1 00:00:00,000 --> 00:00:00,120 2 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 3 00:00:02,500 --> 00:00:03,910 Commons license. 4 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 5 00:00:06,950 --> 00:00:10,600 offer high quality educational resources for free. 6 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 7 00:00:13,500 --> 00:00:17,430 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:17,430 --> 00:00:18,680 ocw.mit.edu. 9 00:00:18,680 --> 00:00:23,660 10 00:00:23,660 --> 00:00:25,450 PROFESSOR: Let's get started. 11 00:00:25,450 --> 00:00:29,750 So today, I'm going to talk a little bit more about 12 00:00:29,750 --> 00:00:33,200 performance issues in parallelization. 13 00:00:33,200 --> 00:00:36,860 A little bit more out of the [INAUDIBLE] 14 00:00:36,860 --> 00:00:39,170 to what people are doing otherwise. 15 00:00:39,170 --> 00:00:44,030 So normally what we have done so far, is we looked at Cilk. 16 00:00:44,030 --> 00:00:46,100 It provides a very robust and environment for 17 00:00:46,100 --> 00:00:46,930 parallelization. 18 00:00:46,930 --> 00:00:50,550 It hides many issues and eliminates many of the 19 00:00:50,550 --> 00:00:56,120 problems out there if you find other areas of parallelization 20 00:00:56,120 --> 00:00:57,710 that you deal with. 21 00:00:57,710 --> 00:01:00,290 And in last lectures, we looked at things like cache 22 00:01:00,290 --> 00:01:01,140 [UNINTELLIGIBLE] 23 00:01:01,140 --> 00:01:04,269 algorithims, algorithmic issues, like 24 00:01:04,269 --> 00:01:06,710 looking at work and spend. 25 00:01:06,710 --> 00:01:08,790 And in fact, in your projects, you're going to use all these 26 00:01:08,790 --> 00:01:13,660 nice concepts to get you a nice parallel learning CorD. 27 00:01:13,660 --> 00:01:20,640 But if you look at a lot of these CorD [UNINTELLIGIBLE] 28 00:01:20,640 --> 00:01:24,150 the very people normally for parallelized CorD outside 29 00:01:24,150 --> 00:01:25,800 probably Cilk. 30 00:01:25,800 --> 00:01:30,270 And there are a lot of other issues that arise, things like 31 00:01:30,270 --> 00:01:33,160 synchronization issues and memory issues. 32 00:01:33,160 --> 00:01:36,380 Today, I think we are going to focus mostly on memory issues. 33 00:01:36,380 --> 00:01:39,450 And we are going to use open OpenMP instead of [INAUDIBLE]. 34 00:01:39,450 --> 00:01:42,400 And most of these issues will be affected on 35 00:01:42,400 --> 00:01:43,710 Cilk sometimes, too. 36 00:01:43,710 --> 00:01:45,990 But Cilk tries to hide them from you. 37 00:01:45,990 --> 00:01:46,920 There's a layer of abstract. 38 00:01:46,920 --> 00:01:49,940 And it's hard to kind of get to those issues in there. 39 00:01:49,940 --> 00:01:53,200 So we are going to look at this thing called OpenMP. 40 00:01:53,200 --> 00:01:55,430 So today, we are going to address things like 41 00:01:55,430 --> 00:01:58,820 granularity of parallelism. 42 00:01:58,820 --> 00:02:00,140 There are so many things that just went out 43 00:02:00,140 --> 00:02:01,820 on the page, I guess. 44 00:02:01,820 --> 00:02:05,060 True sharing, false sharing, load balancing issues, the 45 00:02:05,060 --> 00:02:06,170 [UNINTELLIGIBLE]. 46 00:02:06,170 --> 00:02:10,570 So from the license and keep talking about that we want to 47 00:02:10,570 --> 00:02:13,550 be out of all this not dealing with Voodoo parameter. 48 00:02:13,550 --> 00:02:16,350 Today, we actually are dealing mainly with Voodoo. 49 00:02:16,350 --> 00:02:20,460 So I guess this should be basically 50 00:02:20,460 --> 00:02:21,670 the Halloween lecture. 51 00:02:21,670 --> 00:02:24,703 So we are all about Voodoo today and see how we can deal 52 00:02:24,703 --> 00:02:27,760 with Voodoo issues. 53 00:02:27,760 --> 00:02:32,690 So if you look at a Cilk program, here is a nice simple 54 00:02:32,690 --> 00:02:34,740 matrix multiply, seem to be [INAUDIBLE] 55 00:02:34,740 --> 00:02:36,320 example these days. 56 00:02:36,320 --> 00:02:39,060 What you can do is you can put a Cilk formula in these two 57 00:02:39,060 --> 00:02:42,070 loops and get a nice parallel performance. 58 00:02:42,070 --> 00:02:44,870 However, [UNINTELLIGIBLE] 59 00:02:44,870 --> 00:02:47,270 from where how the memory is arranged is 60 00:02:47,270 --> 00:02:48,820 up to the Cilk scheduler. 61 00:02:48,820 --> 00:02:50,980 Cilk scheduler is doing some work stealing. 62 00:02:50,980 --> 00:02:54,420 Depending on how the work gets distributed, the process will 63 00:02:54,420 --> 00:02:56,090 get worked, it will happen. 64 00:02:56,090 --> 00:02:58,370 Hopefully, everything will go nicely. 65 00:02:58,370 --> 00:03:02,270 And so what that means is it ties the distribution and load 66 00:03:02,270 --> 00:03:04,060 balancing issues. 67 00:03:04,060 --> 00:03:07,640 So it's nice if you have access to Cilk, but many other 68 00:03:07,640 --> 00:03:09,050 [UNINTELLIGIBLE] you might not be. 69 00:03:09,050 --> 00:03:12,320 And even within Cilk, some of these issues might show up. 70 00:03:12,320 --> 00:03:15,390 So what we are going to do is step one 71 00:03:15,390 --> 00:03:17,820 below the Cilk scheduler. 72 00:03:17,820 --> 00:03:22,470 So there's this system called OpenMP. 73 00:03:22,470 --> 00:03:25,640 It's a more simplified model of parallelism. 74 00:03:25,640 --> 00:03:29,665 So what it tries to do is instead of giving this very 75 00:03:29,665 --> 00:03:30,460 [UNINTELLIGIBLE] 76 00:03:30,460 --> 00:03:33,930 system, it lets you basically direct access to the 77 00:03:33,930 --> 00:03:35,110 processors. 78 00:03:35,110 --> 00:03:38,280 So what that means is there's normally what we call a 79 00:03:38,280 --> 00:03:39,210 fork-join model. 80 00:03:39,210 --> 00:03:39,520 [UNINTELLIGIBLE] 81 00:03:39,520 --> 00:03:43,580 we have with Cilk, basically. 82 00:03:43,580 --> 00:03:46,490 We can do fork into different workers and join. 83 00:03:46,490 --> 00:03:49,750 And more or less, you can actually bind these workers to 84 00:03:49,750 --> 00:03:50,420 [UNINTELLIGIBLE] 85 00:03:50,420 --> 00:03:53,510 sometimes or make sure that the number-- 86 00:03:53,510 --> 00:03:56,800 I'll give you some techniques how to do that as I go on. 87 00:03:56,800 --> 00:04:01,540 So for parallel loops, you can do data parallelism, different 88 00:04:01,540 --> 00:04:02,340 [UNINTELLIGIBLE] 89 00:04:02,340 --> 00:04:04,580 parallelism you can do something like fork-join. 90 00:04:04,580 --> 00:04:07,620 And you can see a bunch of static or 91 00:04:07,620 --> 00:04:11,220 dynamic scheduling policies. 92 00:04:11,220 --> 00:04:16,790 So for example in OpenMP, you can see for this loop that add 93 00:04:16,790 --> 00:04:17,959 a pragma to [UNINTELLIGIBLE] 94 00:04:17,959 --> 00:04:21,029 in front of this loop and say this is OpenMP. 95 00:04:21,029 --> 00:04:22,450 Parallel loop in here. 96 00:04:22,450 --> 00:04:24,310 Parallel full loop in this one. 97 00:04:24,310 --> 00:04:26,990 And schedule it using static chunk. 98 00:04:26,990 --> 00:04:29,610 I will tell you what exactly that means. 99 00:04:29,610 --> 00:04:34,680 And that gives you direct access to how each of these 100 00:04:34,680 --> 00:04:36,730 parts will be run. 101 00:04:36,730 --> 00:04:39,320 So let me get a little bit in detail. 102 00:04:39,320 --> 00:04:41,730 So assume you have [UNINTELLIGIBLE] courses in 103 00:04:41,730 --> 00:04:42,950 there, [UNINTELLIGIBLE] processors. 104 00:04:42,950 --> 00:04:46,270 So now in OpenMP, you are basically opening the entire 105 00:04:46,270 --> 00:04:46,990 world underneath. 106 00:04:46,990 --> 00:04:49,840 And you have to kind of see what's going on. 107 00:04:49,840 --> 00:04:55,330 And if you say, schedule a static chunk of four, assume 108 00:04:55,330 --> 00:04:58,050 you have 16 iterations. 109 00:04:58,050 --> 00:05:00,200 Here are my 16 iterations. 110 00:05:00,200 --> 00:05:02,990 So each of these dots represent a value for i. 111 00:05:02,990 --> 00:05:07,890 So what it says is you take chunks of four and basically 112 00:05:07,890 --> 00:05:08,950 send it across it. 113 00:05:08,950 --> 00:05:11,190 So what happens is the first four iterations will go to 114 00:05:11,190 --> 00:05:13,240 [UNINTELLIGIBLE] or core zero. 115 00:05:13,240 --> 00:05:15,650 Next four will go to core one, core two and core three. 116 00:05:15,650 --> 00:05:18,870 So you know exactly which iterations run where. 117 00:05:18,870 --> 00:05:20,660 It's a very static thing. 118 00:05:20,660 --> 00:05:23,100 You have full control of what's going on. 119 00:05:23,100 --> 00:05:26,980 Whereas in Cilk, it's up to the scheduler. 120 00:05:26,980 --> 00:05:29,090 So the nice thing here is you can have full control. 121 00:05:29,090 --> 00:05:30,960 But you get enough room to harm yourself if 122 00:05:30,960 --> 00:05:32,260 you do things wrong. 123 00:05:32,260 --> 00:05:36,510 So this is a double-edged sword in that sense. 124 00:05:36,510 --> 00:05:39,990 So instead of doing static five you do static two. 125 00:05:39,990 --> 00:05:43,010 You're assigning chunks of size two. 126 00:05:43,010 --> 00:05:46,125 What it will do is it will assign chunks of size two to 127 00:05:46,125 --> 00:05:48,450 the four cores. 128 00:05:48,450 --> 00:05:49,770 And then you're not done yet. 129 00:05:49,770 --> 00:05:52,620 And then you start with again core zero and assign 130 00:05:52,620 --> 00:05:53,980 chunks of size two. 131 00:05:53,980 --> 00:05:57,880 This is called block cyclic schedule. 132 00:05:57,880 --> 00:05:59,880 And if you do a chunk of size one, it's 133 00:05:59,880 --> 00:06:00,890 called a cyclic schedule. 134 00:06:00,890 --> 00:06:08,190 [UNINTELLIGIBLE] cycles just assigning iterations to cores. 135 00:06:08,190 --> 00:06:08,830 OK. 136 00:06:08,830 --> 00:06:11,700 So far so good? 137 00:06:11,700 --> 00:06:13,780 OK. 138 00:06:13,780 --> 00:06:15,780 So I want to do something. 139 00:06:15,780 --> 00:06:19,740 So I have this program. 140 00:06:19,740 --> 00:06:21,790 This is again your 141 00:06:21,790 --> 00:06:24,350 run-of-the-mill matrix multiply. 142 00:06:24,350 --> 00:06:30,070 And I ran a sequential single machine, and I got this 143 00:06:30,070 --> 00:06:31,860 performance. 144 00:06:31,860 --> 00:06:34,750 Then, I said look, I want to parallelize the outer loop. 145 00:06:34,750 --> 00:06:37,640 So I parallelize this loop. 146 00:06:37,640 --> 00:06:40,710 What should I get? 147 00:06:40,710 --> 00:06:41,790 [UNINTELLIGIBLE] fast or slow. 148 00:06:41,790 --> 00:06:46,970 I want to just check whether you are awake or sleep. 149 00:06:46,970 --> 00:06:47,890 How do [UNINTELLIGIBLE PHRASE] 150 00:06:47,890 --> 00:06:49,140 to run slower. 151 00:06:49,140 --> 00:06:53,030 152 00:06:53,030 --> 00:06:54,370 It's not a trick question. 153 00:06:54,370 --> 00:06:57,460 This is just to make sure that actually participate. 154 00:06:57,460 --> 00:06:59,230 How do people think it's run faster? 155 00:06:59,230 --> 00:07:00,950 AUDIENCE: [INAUDIBLE] 156 00:07:00,950 --> 00:07:01,100 PROFESSOR: OK. 157 00:07:01,100 --> 00:07:02,390 Good. 158 00:07:02,390 --> 00:07:05,234 What do others think? 159 00:07:05,234 --> 00:07:06,070 OK. 160 00:07:06,070 --> 00:07:08,980 They're probably checking their email or something. 161 00:07:08,980 --> 00:07:10,550 OK. 162 00:07:10,550 --> 00:07:14,340 So actually it ran faster. 163 00:07:14,340 --> 00:07:18,910 The source not run on the common cloud machines. 164 00:07:18,910 --> 00:07:20,990 This was a previous generation that I ran. 165 00:07:20,990 --> 00:07:23,390 So [UNINTELLIGIBLE] was seven times faster. 166 00:07:23,390 --> 00:07:24,290 So this was great. 167 00:07:24,290 --> 00:07:25,670 I parallized outer loop. 168 00:07:25,670 --> 00:07:28,390 What happens if I parallize inner loop? 169 00:07:28,390 --> 00:07:31,570 So this test, this i loop, runs parallel. 170 00:07:31,570 --> 00:07:34,060 Here, I launch the [UNINTELLIGIBLE] parallel. 171 00:07:34,060 --> 00:07:38,138 How much people thinks this runs faster? 172 00:07:38,138 --> 00:07:39,970 AUDIENCE: [INAUDIBLE] 173 00:07:39,970 --> 00:07:42,800 PROFESSOR: Compared to this one. 174 00:07:42,800 --> 00:07:45,290 How many people thinks this runs slower? 175 00:07:45,290 --> 00:07:46,530 OK. 176 00:07:46,530 --> 00:07:50,530 There's some consistent answers here. 177 00:07:50,530 --> 00:07:52,956 Why do you think it would run slower? 178 00:07:52,956 --> 00:07:56,130 179 00:07:56,130 --> 00:07:57,720 So OK. 180 00:07:57,720 --> 00:08:00,180 It ran slower, so it can improve that. 181 00:08:00,180 --> 00:08:02,270 And that's a little bit slow. 182 00:08:02,270 --> 00:08:05,774 Why is it slow? 183 00:08:05,774 --> 00:08:08,250 AUDIENCE: [INAUDIBLE] 184 00:08:08,250 --> 00:08:09,050 PROFESSOR: Exactly. 185 00:08:09,050 --> 00:08:12,520 So what it's doing here, it's basically spawning many, many 186 00:08:12,520 --> 00:08:13,200 times in here. 187 00:08:13,200 --> 00:08:17,530 Because every time you have parallelism, you chunkify into 188 00:08:17,530 --> 00:08:18,500 the processor. 189 00:08:18,500 --> 00:08:21,450 Here you are getting a lot more smaller chunks inside. 190 00:08:21,450 --> 00:08:24,700 So let's look at how this is basically run. 191 00:08:24,700 --> 00:08:31,075 So normally, you can think about an OpenMP program as you 192 00:08:31,075 --> 00:08:32,549 have one sequential thread. 193 00:08:32,549 --> 00:08:34,000 You run the main program. 194 00:08:34,000 --> 00:08:37,760 And then assume you have, in cores, you might have n minus 195 00:08:37,760 --> 00:08:42,100 1 other thread just waiting for work. 196 00:08:42,100 --> 00:08:45,270 And then, when you finally come to the parallel 197 00:08:45,270 --> 00:08:48,180 loop, it says, OK. 198 00:08:48,180 --> 00:08:48,730 Set up. 199 00:08:48,730 --> 00:08:52,080 What do you want to run on other basic cores. 200 00:08:52,080 --> 00:08:53,510 And release it. 201 00:08:53,510 --> 00:08:55,360 Release these waiting people. 202 00:08:55,360 --> 00:08:57,020 And let them start working on the parallel work. 203 00:08:57,020 --> 00:08:59,650 And also, I will start doing on my own chunk. 204 00:08:59,650 --> 00:09:03,720 So suddenly, when you say parallel four, it releases all 205 00:09:03,720 --> 00:09:07,310 other cores to go run that part of the core. 206 00:09:07,310 --> 00:09:11,740 And once it's done, it's will say, OK, I'm done. 207 00:09:11,740 --> 00:09:13,140 I have to wait until everybody is done. 208 00:09:13,140 --> 00:09:16,570 So even if the main guy is done, it has to wait until 209 00:09:16,570 --> 00:09:18,200 everybody is finished. 210 00:09:18,200 --> 00:09:22,610 And then, start executing the sequence [UNINTELLIGIBLE]. 211 00:09:22,610 --> 00:09:27,670 So this is the gist of how OpenMP program is run. 212 00:09:27,670 --> 00:09:30,840 And if you realize that it all heads here because you have 213 00:09:30,840 --> 00:09:33,150 basically make sure all these cases are broken up. 214 00:09:33,150 --> 00:09:35,740 So there's some things that has to be issued. 215 00:09:35,740 --> 00:09:38,590 And there's a delay between these guys can start if 216 00:09:38,590 --> 00:09:39,990 everybody has equal work. 217 00:09:39,990 --> 00:09:42,200 Despite not finishing on time because it may take some time 218 00:09:42,200 --> 00:09:43,430 for this to start. 219 00:09:43,430 --> 00:09:46,490 And then, it has to also tell this back OK, I am done. 220 00:09:46,490 --> 00:09:48,320 So there's a lot of synchronization going on. 221 00:09:48,320 --> 00:09:49,390 Locks and unlocks. 222 00:09:49,390 --> 00:09:51,020 Here it's called various synchronization here. 223 00:09:51,020 --> 00:09:54,085 224 00:09:54,085 --> 00:09:58,400 And so if this work is small, this synchronization starts 225 00:09:58,400 --> 00:10:00,070 dominating. 226 00:10:00,070 --> 00:10:03,530 So what happens is [UNINTELLIGIBLE] 227 00:10:03,530 --> 00:10:05,490 fine grain parallelism. 228 00:10:05,490 --> 00:10:08,090 Do a little work in the parallel region, and 229 00:10:08,090 --> 00:10:10,880 synchronization will basically start dominating your time. 230 00:10:10,880 --> 00:10:13,310 So how do you take this? 231 00:10:13,310 --> 00:10:15,880 And sometimes when you run something parallel, it might 232 00:10:15,880 --> 00:10:19,150 even run slow because the amount of stuff in the 233 00:10:19,150 --> 00:10:22,420 parallel region is so small, [UNINTELLIGIBLE] will start 234 00:10:22,420 --> 00:10:22,760 dominating. 235 00:10:22,760 --> 00:10:24,010 And that's not a good way. 236 00:10:24,010 --> 00:10:26,700 And also, sometimes you assume. 237 00:10:26,700 --> 00:10:29,240 And you keep increasing the number of cores. 238 00:10:29,240 --> 00:10:32,750 Hopefully, you want to see a nice parallelism increase, but 239 00:10:32,750 --> 00:10:35,090 it doesn't, even though you have enough information. 240 00:10:35,090 --> 00:10:37,880 But that means you're running a lot of small chunks, even 241 00:10:37,880 --> 00:10:39,390 though you seem to have a lot of parallelism available. 242 00:10:39,390 --> 00:10:42,730 243 00:10:42,730 --> 00:10:46,160 And also, you can make sure the synchronization in the 244 00:10:46,160 --> 00:10:47,210 time in the parallel region. 245 00:10:47,210 --> 00:10:48,720 If the parallel regions are on a very 246 00:10:48,720 --> 00:10:50,670 short time, this happens. 247 00:10:50,670 --> 00:10:54,780 We saw this effect when we were doing Cilk. 248 00:10:54,780 --> 00:10:56,030 Remember? 249 00:10:56,030 --> 00:10:59,450 When did we see this granularity affecting Cilk? 250 00:10:59,450 --> 00:11:00,750 And what did he do? 251 00:11:00,750 --> 00:11:03,900 252 00:11:03,900 --> 00:11:04,910 When you write Cilk programs. 253 00:11:04,910 --> 00:11:05,660 You write [UNINTELLIGIBLE] 254 00:11:05,660 --> 00:11:06,950 programs. 255 00:11:06,950 --> 00:11:12,500 Where did the granularity start showing up on us? 256 00:11:12,500 --> 00:11:15,160 It may not be exactly this because the scheduling is 257 00:11:15,160 --> 00:11:15,430 complicated. 258 00:11:15,430 --> 00:11:15,890 OK. 259 00:11:15,890 --> 00:11:16,380 Yes? 260 00:11:16,380 --> 00:11:19,280 AUDIENCE: The two by two matrix [INAUDIBLE] 261 00:11:19,280 --> 00:11:19,720 PROFESSOR: Yeah. 262 00:11:19,720 --> 00:11:20,420 Something like two by-- 263 00:11:20,420 --> 00:11:22,800 for example, that's the reason we wanted to have 264 00:11:22,800 --> 00:11:24,580 a large base case. 265 00:11:24,580 --> 00:11:26,900 Because if you didn't put a large base case, it keeps 266 00:11:26,900 --> 00:11:30,110 dividing into smaller and smaller problems. 267 00:11:30,110 --> 00:11:32,580 And if the schedule is smart, it won't be 268 00:11:32,580 --> 00:11:33,700 doing exactly this. 269 00:11:33,700 --> 00:11:37,350 But it's always good to have these large granulated chunks 270 00:11:37,350 --> 00:11:38,620 at the bottom. 271 00:11:38,620 --> 00:11:41,870 272 00:11:41,870 --> 00:11:43,890 So how to get [UNINTELLIGIBLE] 273 00:11:43,890 --> 00:11:45,560 granulated parallelism. 274 00:11:45,560 --> 00:11:48,080 What we need to do is reduce the number of [UNINTELLIGIBLE] 275 00:11:48,080 --> 00:11:48,770 equations. 276 00:11:48,770 --> 00:11:51,760 So you want to always try to look for the outer most loop 277 00:11:51,760 --> 00:11:55,080 you can get at all the really large independent regions. 278 00:11:55,080 --> 00:11:57,040 So you go look, and not [UNINTELLIGIBLE] 279 00:11:57,040 --> 00:11:57,930 thing you want to parallelize. 280 00:11:57,930 --> 00:12:01,165 You go up, up, up, up until the point you can parallelize. 281 00:12:01,165 --> 00:12:05,490 And that's the best way to get good performance. 282 00:12:05,490 --> 00:12:06,740 OK? 283 00:12:06,740 --> 00:12:09,580 284 00:12:09,580 --> 00:12:13,850 So if you really compare these three programs here, again, 285 00:12:13,850 --> 00:12:15,210 what you see-- 286 00:12:15,210 --> 00:12:16,720 of course, this has no synchronization. 287 00:12:16,720 --> 00:12:19,400 This has n amount of synchronizations. 288 00:12:19,400 --> 00:12:20,870 Here in [UNINTELLIGIBLE] synchronization, that's 289 00:12:20,870 --> 00:12:23,210 obviously a lot more synchronization going on. 290 00:12:23,210 --> 00:12:27,190 And that is where this [UNINTELLIGIBLE] comes from. 291 00:12:27,190 --> 00:12:27,620 OK. 292 00:12:27,620 --> 00:12:30,260 So now, I am switching a little bit in here. 293 00:12:30,260 --> 00:12:33,450 I want you guys to look at this program a little bit. 294 00:12:33,450 --> 00:12:35,000 So what am I doing here? 295 00:12:35,000 --> 00:12:37,580 I have two [UNINTELLIGIBLE]. 296 00:12:37,580 --> 00:12:45,110 And I am just basically adding matrix B to matrix A. OK? 297 00:12:45,110 --> 00:12:48,720 And then I have another loop test here, adding matrix C to 298 00:12:48,720 --> 00:12:52,120 matrix A. And what am I doing in here? 299 00:12:52,120 --> 00:12:55,740 I am basically going through matrix A in another 300 00:12:55,740 --> 00:12:59,020 direction in here. 301 00:12:59,020 --> 00:13:00,270 AUDIENCE: [INAUDIBLE] 302 00:13:00,270 --> 00:13:02,310 303 00:13:02,310 --> 00:13:03,700 PROFESSOR: It's not really a transpose. 304 00:13:03,700 --> 00:13:04,590 I'm not transposing. 305 00:13:04,590 --> 00:13:08,830 What I'm doing is I'm actually doing a mirror because the C 306 00:13:08,830 --> 00:13:10,210 gets mirrored on ix. 307 00:13:10,210 --> 00:13:11,030 It's because [UNINTELLIGIBLE] 308 00:13:11,030 --> 00:13:11,590 ix [UNINTELLIGIBLE] 309 00:13:11,590 --> 00:13:12,425 the other direction. 310 00:13:12,425 --> 00:13:12,740 OK? 311 00:13:12,740 --> 00:13:14,400 So it's not really a transpose. 312 00:13:14,400 --> 00:13:16,960 So I do a mirror addition. 313 00:13:16,960 --> 00:13:19,010 And then I'm asking for the two outer 314 00:13:19,010 --> 00:13:22,130 most loops to be parallel. 315 00:13:22,130 --> 00:13:25,440 So if you run this sequential-- 316 00:13:25,440 --> 00:13:29,870 OK, you get about 30 milliseconds, I 317 00:13:29,870 --> 00:13:32,120 guess, to run in here. 318 00:13:32,120 --> 00:13:33,700 So that is in [UNINTELLIGIBLE]. 319 00:13:33,700 --> 00:13:35,580 But if you're running parallel, what do you get? 320 00:13:35,580 --> 00:13:36,840 Should you get faster or slower? 321 00:13:36,840 --> 00:13:40,960 322 00:13:40,960 --> 00:13:42,310 OK. 323 00:13:42,310 --> 00:13:45,200 Anyone want to take a guess [UNINTELLIGIBLE] 324 00:13:45,200 --> 00:13:46,995 Sometimes some of these questions, you might not have 325 00:13:46,995 --> 00:13:48,040 enough information to answer. 326 00:13:48,040 --> 00:13:50,750 But it's still good to just take a stand on one direction 327 00:13:50,750 --> 00:13:51,450 or another. 328 00:13:51,450 --> 00:13:54,120 How many people think it runs faster? 329 00:13:54,120 --> 00:13:56,350 How many people think it runs slower? 330 00:13:56,350 --> 00:13:57,750 OK. 331 00:13:57,750 --> 00:13:59,550 Some. 332 00:13:59,550 --> 00:14:01,940 Oops. 333 00:14:01,940 --> 00:14:03,190 What happened? 334 00:14:03,190 --> 00:14:07,310 335 00:14:07,310 --> 00:14:09,140 What happened in here? 336 00:14:09,140 --> 00:14:19,750 337 00:14:19,750 --> 00:14:25,150 Can anybody point out why it might be running slower 338 00:14:25,150 --> 00:14:27,880 parallely than running sequentially? 339 00:14:27,880 --> 00:14:31,275 340 00:14:31,275 --> 00:14:33,220 AUDIENCE: [INAUDIBLE] 341 00:14:33,220 --> 00:14:33,550 PROFESSOR: Yeah. 342 00:14:33,550 --> 00:14:34,760 There's a cache issue. 343 00:14:34,760 --> 00:14:38,611 Watch the possible cache issue in here. 344 00:14:38,611 --> 00:14:41,600 AUDIENCE: [INAUDIBLE] 345 00:14:41,600 --> 00:14:42,850 PROFESSOR: Yeah. 346 00:14:42,850 --> 00:14:44,610 347 00:14:44,610 --> 00:14:50,610 If you think about, the first equations of, I guess, the 348 00:14:50,610 --> 00:14:51,080 first core-- 349 00:14:51,080 --> 00:14:52,230 I have some diagram. 350 00:14:52,230 --> 00:14:53,740 I'll show it to you in there. 351 00:14:53,740 --> 00:14:56,850 And here, only the last data elements we'll get for the 352 00:14:56,850 --> 00:14:58,760 first iterations because we are going 353 00:14:58,760 --> 00:14:59,970 in the other direction. 354 00:14:59,970 --> 00:15:03,500 So if you look at it a little more deeply into 355 00:15:03,500 --> 00:15:04,980 what's going on. 356 00:15:04,980 --> 00:15:08,710 Number of instructions seem to be a little higher. 357 00:15:08,710 --> 00:15:11,310 This one I couldn't actually explain why this might be the 358 00:15:11,310 --> 00:15:13,270 case in here. 359 00:15:13,270 --> 00:15:15,530 If anybody has an idea, you can say that. 360 00:15:15,530 --> 00:15:18,620 But this was kind of [UNINTELLIGIBLE]. 361 00:15:18,620 --> 00:15:21,500 This might be [UNINTELLIGIBLE] the cycles. 362 00:15:21,500 --> 00:15:22,750 Huh. 363 00:15:22,750 --> 00:15:27,830 364 00:15:27,830 --> 00:15:28,600 OK. 365 00:15:28,600 --> 00:15:30,610 I can explain this. 366 00:15:30,610 --> 00:15:34,730 Because this is a sequential run, this is a sum total of a 367 00:15:34,730 --> 00:15:36,380 parallel run. 368 00:15:36,380 --> 00:15:40,560 So because of all the overhead that happens because this was 369 00:15:40,560 --> 00:15:43,200 running on, I think, an eight core machine. 370 00:15:43,200 --> 00:15:45,630 So you're running eight times of small companies. 371 00:15:45,630 --> 00:15:48,420 There's a lot of overhead that goes around, synchronization, 372 00:15:48,420 --> 00:15:49,320 and stuff like that. 373 00:15:49,320 --> 00:15:52,050 So a number of instructions just blows up. 374 00:15:52,050 --> 00:15:55,998 But for each core, you don't have this blow up. 375 00:15:55,998 --> 00:15:57,248 AUDIENCE: [INAUDIBLE] 376 00:15:57,248 --> 00:15:59,484 377 00:15:59,484 --> 00:15:59,982 Cilk? 378 00:15:59,982 --> 00:16:03,468 Because does Cilk have different processor affinity, 379 00:16:03,468 --> 00:16:06,456 things that open [UNINTELLIGIBLE]? 380 00:16:06,456 --> 00:16:11,960 Because it seems like if the program, the language-- 381 00:16:11,960 --> 00:16:13,786 PROFESSOR: [INAUDIBLE]. 382 00:16:13,786 --> 00:16:16,610 Let's see if we can process the affinity 383 00:16:16,610 --> 00:16:18,670 information or if not. 384 00:16:18,670 --> 00:16:21,330 It's just pure [UNINTELLIGIBLE]. 385 00:16:21,330 --> 00:16:23,780 AUDIENCE: [INAUDIBLE] 386 00:16:23,780 --> 00:16:24,600 PROFESSOR: Yeah. 387 00:16:24,600 --> 00:16:28,430 I mean if you like executed locally if you have good cache 388 00:16:28,430 --> 00:16:28,900 [UNINTELLIGIBLE] 389 00:16:28,900 --> 00:16:29,210 with them. 390 00:16:29,210 --> 00:16:31,060 But if there's no cache [UNINTELLIGIBLE] 391 00:16:31,060 --> 00:16:32,670 you might steal something where data might 392 00:16:32,670 --> 00:16:33,887 be somewhere else. 393 00:16:33,887 --> 00:16:37,296 AUDIENCE: But you'll still mimic the cache behavior, 394 00:16:37,296 --> 00:16:42,166 considerably, except for when you steal. 395 00:16:42,166 --> 00:16:44,601 [INAUDIBLE] 396 00:16:44,601 --> 00:16:45,088 PROFESSOR: Yeah. 397 00:16:45,088 --> 00:16:46,070 So OK. 398 00:16:46,070 --> 00:16:49,450 We don't have a mic in here. 399 00:16:49,450 --> 00:16:49,640 OK. 400 00:16:49,640 --> 00:16:50,570 There's a mic. 401 00:16:50,570 --> 00:16:51,820 There we go. 402 00:16:51,820 --> 00:16:54,380 403 00:16:54,380 --> 00:16:57,110 But if you have two different of these regions, the way the 404 00:16:57,110 --> 00:17:01,985 parallelization happens can be different. 405 00:17:01,985 --> 00:17:06,440 AUDIENCE: In Cilk, what happens is the code is 406 00:17:06,440 --> 00:17:10,569 mimicking, for the most part, exactly what the C or C++ code 407 00:17:10,569 --> 00:17:11,930 would be doing. 408 00:17:11,930 --> 00:17:15,339 And so you get exactly the same cache hits and misses. 409 00:17:15,339 --> 00:17:19,690 Except when you steal, it's like starting over with an 410 00:17:19,690 --> 00:17:21,550 empty cache. 411 00:17:21,550 --> 00:17:21,770 OK? 412 00:17:21,770 --> 00:17:24,589 But as long as you have sufficient parallelism, the 413 00:17:24,589 --> 00:17:29,080 steals don't occur very often. 414 00:17:29,080 --> 00:17:31,520 And so therefore, you end up getting the same kind of 415 00:17:31,520 --> 00:17:33,740 behavior that you would get out of the serial code. 416 00:17:33,740 --> 00:17:34,120 PROFESSOR: Yeah. 417 00:17:34,120 --> 00:17:37,250 But Charles, in this one, because you had to steal 418 00:17:37,250 --> 00:17:41,240 everything here, the between here and here, the parallelism 419 00:17:41,240 --> 00:17:41,830 would be different. 420 00:17:41,830 --> 00:17:43,220 AUDIENCE: There would be no affinity between those two. 421 00:17:43,220 --> 00:17:44,380 PROFESSOR: No [? affinity ?] will be there. 422 00:17:44,380 --> 00:17:45,300 Yeah, exactly. 423 00:17:45,300 --> 00:17:48,180 So in the sequential one, everything that fits in the 424 00:17:48,180 --> 00:17:50,720 cache, so that would be affinity because we are not 425 00:17:50,720 --> 00:17:51,790 doing parallelism. 426 00:17:51,790 --> 00:17:53,510 And that's what I think happened here. 427 00:17:53,510 --> 00:17:54,990 Because you had no [? affinity-- ?] 428 00:17:54,990 --> 00:17:57,960 AUDIENCE: No in a serial code, there's no [? affinity. ?] 429 00:17:57,960 --> 00:17:58,070 PROFESSOR: No. 430 00:17:58,070 --> 00:18:00,920 Serial code-- if this fits in the cache, it's then running 431 00:18:00,920 --> 00:18:03,090 on one core. 432 00:18:03,090 --> 00:18:07,900 So if it fits in the one core's cache, you're happy. 433 00:18:07,900 --> 00:18:10,770 AUDIENCE: So the issue is-- right-- is if you only access 434 00:18:10,770 --> 00:18:14,860 it once, by the time you fill up the cache-- 435 00:18:14,860 --> 00:18:16,600 It takes some time to fill up the cache to get them 436 00:18:16,600 --> 00:18:17,280 synchronized. 437 00:18:17,280 --> 00:18:19,050 PROFESSOR: So it fits in the one core's cache, it's OK. 438 00:18:19,050 --> 00:18:21,210 Otherwise, it has no affinity in here. 439 00:18:21,210 --> 00:18:26,290 So the key difference in here is, of course, CPI is slow. 440 00:18:26,290 --> 00:18:27,530 We don't know exactly why. 441 00:18:27,530 --> 00:18:28,910 But in [UNINTELLIGIBLE]. 442 00:18:28,910 --> 00:18:31,480 So what you find is that there's a huge amount of cache 443 00:18:31,480 --> 00:18:32,150 in [UNINTELLIGIBLE] 444 00:18:32,150 --> 00:18:33,660 going on. 445 00:18:33,660 --> 00:18:35,660 So that should give you a feeling of what's going on. 446 00:18:35,660 --> 00:18:37,370 So let's look at what might happen. 447 00:18:37,370 --> 00:18:42,070 So I'm showing this matches [UNINTELLIGIBLE] 448 00:18:42,070 --> 00:18:46,950 last year on-- what we had were Cagnode machines that 449 00:18:46,950 --> 00:18:49,130 were basically code to quad processor. 450 00:18:49,130 --> 00:18:50,620 So we had eight codes in here. 451 00:18:50,620 --> 00:18:52,956 And I put them-- you don't have to now 452 00:18:52,956 --> 00:18:53,680 look at this table. 453 00:18:53,680 --> 00:18:57,210 I put them in the slides so you can look at it later. 454 00:18:57,210 --> 00:18:58,920 And so this is the last year's machine. 455 00:18:58,920 --> 00:19:01,360 And of course, this year's machine is different. 456 00:19:01,360 --> 00:19:04,610 We have two six core processors in here. 457 00:19:04,610 --> 00:19:06,330 So this is what we [UNINTELLIGIBLE] 458 00:19:06,330 --> 00:19:07,650 this year. 459 00:19:07,650 --> 00:19:09,380 [UNINTELLIGIBLE] 460 00:19:09,380 --> 00:19:10,390 OK. 461 00:19:10,390 --> 00:19:13,670 And so right now, I'm showing numbers for this one. 462 00:19:13,670 --> 00:19:15,830 And later, I will show what happened in the 463 00:19:15,830 --> 00:19:16,770 [UNINTELLIGIBLE]. 464 00:19:16,770 --> 00:19:19,360 So if you look at a cache-- 465 00:19:19,360 --> 00:19:25,390 so what happened is each of the data items in the cache 466 00:19:25,390 --> 00:19:27,340 can be in multiple states. 467 00:19:27,340 --> 00:19:29,600 This is called MSI protocol here. 468 00:19:29,600 --> 00:19:33,120 What that means is the item might be modified. 469 00:19:33,120 --> 00:19:36,490 If it is modified, it can be only in one cache. 470 00:19:36,490 --> 00:19:40,220 If anybody else wants to touch it, it has to get it out of 471 00:19:40,220 --> 00:19:42,340 the modified state. 472 00:19:42,340 --> 00:19:43,670 Or it can be sharing. 473 00:19:43,670 --> 00:19:46,190 Sharing means it's reading. 474 00:19:46,190 --> 00:19:49,780 So that means that item is read by multiple people. 475 00:19:49,780 --> 00:19:51,160 And that can have multiple covers. 476 00:19:51,160 --> 00:19:53,375 So sharing items can be in multiple places. 477 00:19:53,375 --> 00:19:55,890 478 00:19:55,890 --> 00:19:59,220 However if you're modifying, [UNINTELLIGIBLE] 479 00:19:59,220 --> 00:20:00,890 items in everybody else. 480 00:20:00,890 --> 00:20:03,130 So that means if I modify something, it 481 00:20:03,130 --> 00:20:04,230 can only be in mine. 482 00:20:04,230 --> 00:20:06,540 If other people had that data, I had to go in 483 00:20:06,540 --> 00:20:07,070 and validate this. 484 00:20:07,070 --> 00:20:09,470 So if you modify this, I had to go in and validate it. 485 00:20:09,470 --> 00:20:10,740 So that's a sharing state. 486 00:20:10,740 --> 00:20:11,910 That means I'm [UNINTELLIGIBLE] everybody 487 00:20:11,910 --> 00:20:13,130 [UNINTELLIGIBLE] 488 00:20:13,130 --> 00:20:13,880 read this. 489 00:20:13,880 --> 00:20:16,290 But if I ever want to change that, I have to go in and 490 00:20:16,290 --> 00:20:17,780 validate this one. 491 00:20:17,780 --> 00:20:20,530 So what that means is when I start writing, I am validating 492 00:20:20,530 --> 00:20:21,750 it from everybody. 493 00:20:21,750 --> 00:20:25,980 So even if everybody kept a copy and they start modifying, 494 00:20:25,980 --> 00:20:28,540 I had to get my own copy. 495 00:20:28,540 --> 00:20:30,000 And everybody else will invalidate. 496 00:20:30,000 --> 00:20:32,260 And then, if somebody else wanted to read-- 497 00:20:32,260 --> 00:20:34,410 for example, if this guy wants to read-- 498 00:20:34,410 --> 00:20:35,960 basically, this has to make this a 499 00:20:35,960 --> 00:20:37,950 sharing and back to sharing. 500 00:20:37,950 --> 00:20:40,080 That means I have to get the value 13, propogate it, and 501 00:20:40,080 --> 00:20:43,100 this becomes sharing again. 502 00:20:43,100 --> 00:20:43,560 OK. 503 00:20:43,560 --> 00:20:44,600 Did you get that? 504 00:20:44,600 --> 00:20:45,770 What's going on in here? 505 00:20:45,770 --> 00:20:47,520 In the cache? 506 00:20:47,520 --> 00:20:51,050 So reads, everybody can keep a copy if they want. 507 00:20:51,050 --> 00:20:53,530 Write-- only one guy can keep a copy. 508 00:20:53,530 --> 00:20:55,730 So what happens then is true sharing. 509 00:20:55,730 --> 00:20:58,280 So you have these two different cores. 510 00:20:58,280 --> 00:20:59,550 So I want to read something. 511 00:20:59,550 --> 00:21:02,500 So I get it from outside probably on main memory, and I 512 00:21:02,500 --> 00:21:04,820 put it in my cache in here. 513 00:21:04,820 --> 00:21:09,790 And then, the next guy wants to write the same thing. 514 00:21:09,790 --> 00:21:11,200 Assume I'm writing that. 515 00:21:11,200 --> 00:21:14,520 And once I want to write that, I can keep this copy I had 516 00:21:14,520 --> 00:21:16,920 invalidated from here and get a copy here. 517 00:21:16,920 --> 00:21:19,130 And then, if this way, I want to write it again, I have to 518 00:21:19,130 --> 00:21:21,140 basically invalidate it from here and get a copy. 519 00:21:21,140 --> 00:21:23,400 If I'm reading, both of us can keep a copy and just kind of 520 00:21:23,400 --> 00:21:25,570 keep bouncing back and forth, back and forth. 521 00:21:25,570 --> 00:21:30,350 And so if you bounce too many times, you get all of these 522 00:21:30,350 --> 00:21:31,520 invalidations. 523 00:21:31,520 --> 00:21:34,770 So the fact I looked at that I have invalidations basically 524 00:21:34,770 --> 00:21:37,770 tells me something like this is going on. 525 00:21:37,770 --> 00:21:39,505 So what's happening in this program? 526 00:21:39,505 --> 00:21:42,070 527 00:21:42,070 --> 00:21:46,420 When I parallelize this four loop, my four cores-- 528 00:21:46,420 --> 00:21:49,080 basically since I am doing here [UNINTELLIGIBLE]-- 529 00:21:49,080 --> 00:21:51,710 are going to get this nice distribution of 530 00:21:51,710 --> 00:21:53,690 data into the caches. 531 00:21:53,690 --> 00:21:55,900 Assume it fits in cache. 532 00:21:55,900 --> 00:21:56,140 OK. 533 00:21:56,140 --> 00:21:59,050 So all this data nicely fits into cache, and now I'm pretty 534 00:21:59,050 --> 00:22:00,760 happy when I run this one because I got this 535 00:22:00,760 --> 00:22:01,490 data into the cache. 536 00:22:01,490 --> 00:22:02,910 And I write it. 537 00:22:02,910 --> 00:22:07,860 But the minute I go in here, basically this data has to 538 00:22:07,860 --> 00:22:09,110 [UNINTELLIGIBLE]. 539 00:22:09,110 --> 00:22:10,710 540 00:22:10,710 --> 00:22:14,690 OK, because I am going this is n minus i in here so this data 541 00:22:14,690 --> 00:22:15,520 has a flip route. 542 00:22:15,520 --> 00:22:19,380 And by doing this, basically, I incur this huge amount of 543 00:22:19,380 --> 00:22:22,090 [UNINTELLIGIBLE], and it slows down. 544 00:22:22,090 --> 00:22:22,670 OK? 545 00:22:22,670 --> 00:22:25,250 So that's why it didn't work well. 546 00:22:25,250 --> 00:22:28,990 So what can you do? 547 00:22:28,990 --> 00:22:33,580 548 00:22:33,580 --> 00:22:38,230 When you have these read, write and write, write 549 00:22:38,230 --> 00:22:40,050 conflicts in here. 550 00:22:40,050 --> 00:22:44,420 And you have to actually move the data across in here. 551 00:22:44,420 --> 00:22:48,450 And what you can do is look for this true sharing. 552 00:22:48,450 --> 00:22:49,550 You can look at the [UNINTELLIGIBLE] 553 00:22:49,550 --> 00:22:51,030 and see if we have excessive 554 00:22:51,030 --> 00:22:52,910 [UNINTELLIGIBLE], we have a problem. 555 00:22:52,910 --> 00:22:54,990 And how do we eliminate that? 556 00:22:54,990 --> 00:22:56,240 You want to make the sharing minimal. 557 00:22:56,240 --> 00:22:59,580 558 00:22:59,580 --> 00:23:01,984 If you want to get some data into a cache, you want to try 559 00:23:01,984 --> 00:23:04,320 to keep it there as much as possible. 560 00:23:04,320 --> 00:23:06,050 And if you're using, you'd want to try to align 561 00:23:06,050 --> 00:23:08,080 everything across. 562 00:23:08,080 --> 00:23:11,140 So even across different regions, it'll use the same 563 00:23:11,140 --> 00:23:12,950 kind of things. 564 00:23:12,950 --> 00:23:15,120 And/or enforce some kind of [UNINTELLIGIBLE] technique to 565 00:23:15,120 --> 00:23:15,900 keep the data alive. 566 00:23:15,900 --> 00:23:18,045 So there are a lot of techniques in there, but true 567 00:23:18,045 --> 00:23:19,900 sharing can be an interesting problem. 568 00:23:19,900 --> 00:23:23,090 So in here, simple change, yes. 569 00:23:23,090 --> 00:23:28,380 You're, basically, instead of changing A, you change C. So 570 00:23:28,380 --> 00:23:30,130 you write A the same way. 571 00:23:30,130 --> 00:23:33,580 But now what I have done is I am doing the mirror by 572 00:23:33,580 --> 00:23:36,760 changing the axis to C, is to [UNINTELLIGIBLE] is the same 573 00:23:36,760 --> 00:23:38,590 as this axis. 574 00:23:38,590 --> 00:23:40,070 So these two are the same thing. 575 00:23:40,070 --> 00:23:42,620 And the minute I do that, voila! 576 00:23:42,620 --> 00:23:44,210 I get good speed up. 577 00:23:44,210 --> 00:23:47,290 Because if you look at that, my inundations has gone down. 578 00:23:47,290 --> 00:23:48,640 My L1 cache [UNINTELLIGIBLE] has now 579 00:23:48,640 --> 00:23:49,610 really, really gone down. 580 00:23:49,610 --> 00:23:51,030 I'm not doing anything. 581 00:23:51,030 --> 00:23:54,820 And of course, I am doing more instructions here than this 582 00:23:54,820 --> 00:23:57,220 one because-- 583 00:23:57,220 --> 00:24:00,070 I think, the difference between instruction here and 584 00:24:00,070 --> 00:24:03,220 here is because a lot of times synchronization operations are 585 00:24:03,220 --> 00:24:05,530 dynamic because in the [UNINTELLIGIBLE] 586 00:24:05,530 --> 00:24:07,760 miscounted as the instructions, you are busy 587 00:24:07,760 --> 00:24:09,770 waiting in there. 588 00:24:09,770 --> 00:24:13,350 So this number is not really a constant number. 589 00:24:13,350 --> 00:24:14,850 OK, question. 590 00:24:14,850 --> 00:24:15,440 AUDIENCE: Not a question. 591 00:24:15,440 --> 00:24:19,020 So another thing one could do here is do loop fusion. 592 00:24:19,020 --> 00:24:19,530 PROFESSOR: Yes. 593 00:24:19,530 --> 00:24:20,410 Yes. 594 00:24:20,410 --> 00:24:24,100 Here is a nice way of putting both of the loops into one and 595 00:24:24,100 --> 00:24:24,990 do loop fusion. 596 00:24:24,990 --> 00:24:26,300 And that works. 597 00:24:26,300 --> 00:24:28,230 In this case, you can do that. 598 00:24:28,230 --> 00:24:31,360 AUDIENCE: So loop fusion is where you take two loops, and 599 00:24:31,360 --> 00:24:33,620 you convert it into one loop. 600 00:24:33,620 --> 00:24:37,050 So in this case, you could have just written one nest, 601 00:24:37,050 --> 00:24:38,840 which has two things going on inside. 602 00:24:38,840 --> 00:24:41,490 603 00:24:41,490 --> 00:24:44,580 And then you would save all the loop overhead and the 604 00:24:44,580 --> 00:24:46,090 scheduling overhead. 605 00:24:46,090 --> 00:24:49,090 So rather than doing it twice, you actually have reduced the 606 00:24:49,090 --> 00:24:52,890 overhead to just the parallelism of the one loop. 607 00:24:52,890 --> 00:24:55,790 So if you look at that, you'll realize that you could somehow 608 00:24:55,790 --> 00:24:59,670 make it just be a single nest with two statements in there, 609 00:24:59,670 --> 00:25:01,130 rather than one. 610 00:25:01,130 --> 00:25:04,100 PROFESSOR: So basically, instead of [UNINTELLIGIBLE] 611 00:25:04,100 --> 00:25:08,190 entire thing and move plus C into here, basically. 612 00:25:08,190 --> 00:25:10,040 And I could have just done it in one loop nest. 613 00:25:10,040 --> 00:25:11,950 That's what loop fusion would do here. 614 00:25:11,950 --> 00:25:13,750 So we can actually [UNINTELLIGIBLE] 615 00:25:13,750 --> 00:25:15,280 much nicer in here. 616 00:25:15,280 --> 00:25:19,190 But just for example purposes, so now I really reduced this 617 00:25:19,190 --> 00:25:20,540 one and got that. 618 00:25:20,540 --> 00:25:21,740 So this is great. 619 00:25:21,740 --> 00:25:23,830 Cagnodes really showed this classic 620 00:25:23,830 --> 00:25:25,820 problem in the computer. 621 00:25:25,820 --> 00:25:27,070 And so I'm like, OK. 622 00:25:27,070 --> 00:25:28,720 Now we have new machines. 623 00:25:28,720 --> 00:25:30,075 Let's try it and see what happens. 624 00:25:30,075 --> 00:25:34,090 625 00:25:34,090 --> 00:25:35,670 What does this show? 626 00:25:35,670 --> 00:25:41,240 This is your nice cloud machines we've got now. 627 00:25:41,240 --> 00:25:42,480 I have no slow down. 628 00:25:42,480 --> 00:25:47,570 I was really disappointed because beforehand, I had this 629 00:25:47,570 --> 00:25:50,070 for sharing going on in here and had a 630 00:25:50,070 --> 00:25:51,100 really big slow down. 631 00:25:51,100 --> 00:25:54,070 But this one, in fact, the difference is very small. 632 00:25:54,070 --> 00:25:56,540 And when you look at any kind of performance counters, they 633 00:25:56,540 --> 00:25:58,580 are pretty comparable. 634 00:25:58,580 --> 00:26:00,220 There's nothing much going on here. 635 00:26:00,220 --> 00:26:03,580 So what do you think is going on in this new 636 00:26:03,580 --> 00:26:06,130 architecture now? 637 00:26:06,130 --> 00:26:07,400 Why this might be? 638 00:26:07,400 --> 00:26:15,730 639 00:26:15,730 --> 00:26:16,980 AUDIENCE: [INAUDIBLE] 640 00:26:16,980 --> 00:26:26,030 641 00:26:26,030 --> 00:26:28,800 PROFESSOR: That's an interesting observation, but 642 00:26:28,800 --> 00:26:32,050 also we have-- 643 00:26:32,050 --> 00:26:36,250 yes, core seven basically is two by two in the die. 644 00:26:36,250 --> 00:26:38,650 But we also have two different processors. 645 00:26:38,650 --> 00:26:40,110 So that's there, too. 646 00:26:40,110 --> 00:26:43,640 So in some sense, when you get our two-way process 647 00:26:43,640 --> 00:26:44,670 [UNINTELLIGIBLE] 648 00:26:44,670 --> 00:26:45,620 So that's there. 649 00:26:45,620 --> 00:26:46,370 That might help. 650 00:26:46,370 --> 00:26:48,650 That's an interesting observation. 651 00:26:48,650 --> 00:26:50,340 What else might be going on here? 652 00:26:50,340 --> 00:26:52,880 Why do you think they manage to get this one? 653 00:26:52,880 --> 00:26:54,580 What might be another answer? 654 00:26:54,580 --> 00:26:58,790 655 00:26:58,790 --> 00:27:03,350 What can hide these kind of delays that can happen? 656 00:27:03,350 --> 00:27:06,990 Load delays, and cache misses, and stuff like that. 657 00:27:06,990 --> 00:27:09,110 What techniques and hardware can hide those? 658 00:27:09,110 --> 00:27:15,300 659 00:27:15,300 --> 00:27:17,050 Just [UNINTELLIGIBLE] a speculation. 660 00:27:17,050 --> 00:27:18,610 Prefetching. 661 00:27:18,610 --> 00:27:22,290 So most hardware has an internal mechanism. 662 00:27:22,290 --> 00:27:24,890 When you start fetching data, you say, aha! 663 00:27:24,890 --> 00:27:26,420 I see a pattern. 664 00:27:26,420 --> 00:27:27,650 I know you want to get this thing. 665 00:27:27,650 --> 00:27:31,960 Let me go forward and bring more data, thinking you are 666 00:27:31,960 --> 00:27:34,240 going to follow that pattern. 667 00:27:34,240 --> 00:27:34,590 OK. 668 00:27:34,590 --> 00:27:36,240 All or most of the [UNINTELLIGIBLE] 669 00:27:36,240 --> 00:27:39,010 for, I think, have [UNINTELLIGIBLE] even a 670 00:27:39,010 --> 00:27:42,280 Pentium had something for prefetching going on. 671 00:27:42,280 --> 00:27:44,470 But most of the time, what happens is the prefetching 672 00:27:44,470 --> 00:27:45,970 engine can't keep up. 673 00:27:45,970 --> 00:27:47,930 If you are getting there, it's [UNINTELLIGIBLE] 674 00:27:47,930 --> 00:27:48,880 a little bit further. 675 00:27:48,880 --> 00:27:50,650 You are going to catch up, and you're going to start because 676 00:27:50,650 --> 00:27:53,026 you have more [UNINTELLIGIBLE] here. 677 00:27:53,026 --> 00:27:54,980 I think [UNINTELLIGIBLE] 678 00:27:54,980 --> 00:27:56,880 has a really, really good prefetcher. 679 00:27:56,880 --> 00:28:02,170 And then, we saw it in our architecture slides, too. 680 00:28:02,170 --> 00:28:04,940 That a lot of things that used to happen before is gone. 681 00:28:04,940 --> 00:28:05,990 So this is really good. 682 00:28:05,990 --> 00:28:11,330 What that means is a lot of weird stuff that's going on 683 00:28:11,330 --> 00:28:12,360 [UNINTELLIGIBLE] 684 00:28:12,360 --> 00:28:14,010 making them disappear. 685 00:28:14,010 --> 00:28:15,990 So these kind of problems don't show up. 686 00:28:15,990 --> 00:28:17,260 So that's the nice story. 687 00:28:17,260 --> 00:28:18,920 The other part is, OK. 688 00:28:18,920 --> 00:28:22,630 Now if you start really tweaking your programs to one 689 00:28:22,630 --> 00:28:25,120 architecture, you wait a generation. 690 00:28:25,120 --> 00:28:28,300 And then now, we have done either the tweaking-- 691 00:28:28,300 --> 00:28:33,910 the best case, tweaking has no impact, and it's 692 00:28:33,910 --> 00:28:35,170 not affecting anything. 693 00:28:35,170 --> 00:28:37,235 In most of the time, worst case, tweaking actually slows 694 00:28:37,235 --> 00:28:39,160 down the program because you are trying to do something 695 00:28:39,160 --> 00:28:40,030 complicated. 696 00:28:40,030 --> 00:28:42,790 That's just not needed anymore. 697 00:28:42,790 --> 00:28:46,990 So even though these kind of things showed up in 698 00:28:46,990 --> 00:28:47,500 [UNINTELLIGIBLE] 699 00:28:47,500 --> 00:28:48,750 architecture, it's not an issue. 700 00:28:48,750 --> 00:28:52,250 But if you go to many of the smaller architectures that 701 00:28:52,250 --> 00:28:55,830 have that don't have that much of the very popular 702 00:28:55,830 --> 00:28:57,620 prefetchers, this kind of issue you would see. 703 00:28:57,620 --> 00:29:00,610 So for example, if you go to a cell phone [UNINTELLIGIBLE], 704 00:29:00,610 --> 00:29:03,770 you would probably see these kind of issues happening. 705 00:29:03,770 --> 00:29:05,560 Any questions here so far? 706 00:29:05,560 --> 00:29:06,300 So that's the good news. 707 00:29:06,300 --> 00:29:08,280 You guys don't have to worry about it too much. 708 00:29:08,280 --> 00:29:11,440 But at least it's good to know the technique because you'll 709 00:29:11,440 --> 00:29:13,820 see it in other architectures. 710 00:29:13,820 --> 00:29:18,110 So now, I want to switch a little bit into looking at 711 00:29:18,110 --> 00:29:21,630 programs that don't have what we call data parallelism. 712 00:29:21,630 --> 00:29:24,320 That means you can start and say, [UNINTELLIGIBLE] 713 00:29:24,320 --> 00:29:24,740 parallels. 714 00:29:24,740 --> 00:29:26,840 Everybody get the different chunk and run. 715 00:29:26,840 --> 00:29:30,110 And we are going a little bit more deeply into looking at 716 00:29:30,110 --> 00:29:32,230 programs that are a little bit different. 717 00:29:32,230 --> 00:29:36,540 So I wanted to come up with this little representation to 718 00:29:36,540 --> 00:29:37,580 represent the program. 719 00:29:37,580 --> 00:29:42,090 And so if you think about iteration space-- 720 00:29:42,090 --> 00:29:44,090 actually before you, I'll go down to dependence. 721 00:29:44,090 --> 00:29:46,200 I'll also do a little bit of load balance. 722 00:29:46,200 --> 00:29:49,890 So here's a loop that in my iterations-- 723 00:29:49,890 --> 00:29:53,690 the first one I transformed zero to eight. 724 00:29:53,690 --> 00:29:55,670 But J only runs from one to eight. 725 00:29:55,670 --> 00:29:58,630 So each I, I have less and less amount of 726 00:29:58,630 --> 00:30:02,160 J iterations, basically. 727 00:30:02,160 --> 00:30:02,600 OK? 728 00:30:02,600 --> 00:30:05,100 It's a triangular loop. 729 00:30:05,100 --> 00:30:06,350 OK? 730 00:30:06,350 --> 00:30:10,050 731 00:30:10,050 --> 00:30:10,250 OK. 732 00:30:10,250 --> 00:30:12,420 So this is the way to represent iteration space, so 733 00:30:12,420 --> 00:30:15,600 I will represent data and get back to this again. 734 00:30:15,600 --> 00:30:18,830 So if you look at a data space, you can assume data 735 00:30:18,830 --> 00:30:23,740 iteration space could be this funky, triangular, hyperplane 736 00:30:23,740 --> 00:30:24,760 type of thing. 737 00:30:24,760 --> 00:30:28,990 Whereas data is mostly [? rectangulineum ?], 738 00:30:28,990 --> 00:30:30,740 multi-dimensional rectangle. 739 00:30:30,740 --> 00:30:32,810 So for example, if I have [UNINTELLIGIBLE] 740 00:30:32,810 --> 00:30:35,140 and it's a one-dimensional one, this is basically a 741 00:30:35,140 --> 00:30:36,050 two-dimensional data. 742 00:30:36,050 --> 00:30:37,480 And you can have three-dimensional cubes and 743 00:30:37,480 --> 00:30:38,020 stuff like that. 744 00:30:38,020 --> 00:30:39,500 You can represent data like that. 745 00:30:39,500 --> 00:30:41,170 So this is a way to nicely represent. 746 00:30:41,170 --> 00:30:45,000 And when you start thinking about it, we can look at 747 00:30:45,000 --> 00:30:45,850 what's going on. 748 00:30:45,850 --> 00:30:46,780 OK? 749 00:30:46,780 --> 00:30:49,870 So now you have this loop again. 750 00:30:49,870 --> 00:30:52,820 So here's the basic [UNINTELLIGIBLE] iterations. 751 00:30:52,820 --> 00:30:54,140 And here's the data. 752 00:30:54,140 --> 00:30:56,850 Assume this is both A and B. There will be another one for 753 00:30:56,850 --> 00:30:57,480 matrix [UNINTELLIGIBLE] 754 00:30:57,480 --> 00:30:59,870 B. One data into each iteration is going to touch. 755 00:30:59,870 --> 00:31:01,700 So these are the data that need to get touched, and 756 00:31:01,700 --> 00:31:04,270 here's the iterations you are going to run. 757 00:31:04,270 --> 00:31:09,720 So we can say OpenMP parallel four. 758 00:31:09,720 --> 00:31:12,200 So what happens when you do parallel four? 759 00:31:12,200 --> 00:31:14,630 So I am going to get parallel. 760 00:31:14,630 --> 00:31:17,930 And so core one gets this one, another core, another core, 761 00:31:17,930 --> 00:31:20,610 another core get these iterations running. 762 00:31:20,610 --> 00:31:22,398 So what happens if you do this one? 763 00:31:22,398 --> 00:31:26,520 764 00:31:26,520 --> 00:31:29,590 Do you get really good performance? 765 00:31:29,590 --> 00:31:31,852 Why? 766 00:31:31,852 --> 00:31:34,570 AUDIENCE: [INAUDIBLE] 767 00:31:34,570 --> 00:31:35,080 PROFESSOR: It's not balanced. 768 00:31:35,080 --> 00:31:36,900 The load is not balanced in here. 769 00:31:36,900 --> 00:31:40,150 So basically if you run sequential and if you run 770 00:31:40,150 --> 00:31:45,990 block distribution, I get about 3x performance in here. 771 00:31:45,990 --> 00:31:49,470 So if I look at closely, here is the number of iterations 772 00:31:49,470 --> 00:31:51,340 given to each core. 773 00:31:51,340 --> 00:31:53,790 The first core gets almost nothing, and the last guy gets 774 00:31:53,790 --> 00:31:55,380 a lot of work. 775 00:31:55,380 --> 00:31:58,510 Here's where something like the Cilk runtime can come into 776 00:31:58,510 --> 00:32:01,955 play because with Cilk runtime, basically, this guy 777 00:32:01,955 --> 00:32:03,040 will finish the [UNINTELLIGIBLE] 778 00:32:03,040 --> 00:32:04,620 start stealing from somebody else. 779 00:32:04,620 --> 00:32:07,620 And so it would be done nicely. 780 00:32:07,620 --> 00:32:09,860 But whereas if you do a static schedule, you 781 00:32:09,860 --> 00:32:11,190 are in this big bind. 782 00:32:11,190 --> 00:32:13,270 You don't have too many things going on. 783 00:32:13,270 --> 00:32:17,990 784 00:32:17,990 --> 00:32:20,650 And basically, this is what we call load imbalance. 785 00:32:20,650 --> 00:32:24,760 So what you can do is figure out a complicated partitioning 786 00:32:24,760 --> 00:32:27,645 so you can statically partition this out. 787 00:32:27,645 --> 00:32:31,810 Or you can do something like the dynamic scheduler like the 788 00:32:31,810 --> 00:32:32,530 [UNINTELLIGIBLE] 789 00:32:32,530 --> 00:32:35,720 scheduler for a solution. 790 00:32:35,720 --> 00:32:36,970 So how to detect load imbalance? 791 00:32:36,970 --> 00:32:39,290 792 00:32:39,290 --> 00:32:44,120 Basically, what you might want to do is for each of the 793 00:32:44,120 --> 00:32:45,690 different sections you are running, you want to look at 794 00:32:45,690 --> 00:32:47,520 the time mistakes. 795 00:32:47,520 --> 00:32:50,530 And in the [UNINTELLIGIBLE] axis varying, huge varying, 796 00:32:50,530 --> 00:32:52,260 that means there's a load imbalance going on. 797 00:32:52,260 --> 00:32:55,345 So you might want to check and make sure each of the parallel 798 00:32:55,345 --> 00:32:58,330 regions time is taking. 799 00:32:58,330 --> 00:33:01,370 And that gives you this view. 800 00:33:01,370 --> 00:33:04,620 How to eliminate load imbalance or the use of 801 00:33:04,620 --> 00:33:08,500 dynamic scheduler that will deal with that. 802 00:33:08,500 --> 00:33:12,220 Or you can do a different distribution statically. 803 00:33:12,220 --> 00:33:14,690 That will not partition in this large block. 804 00:33:14,690 --> 00:33:16,900 So let me show you a static part because we have already 805 00:33:16,900 --> 00:33:18,440 learned the dynamic part before. 806 00:33:18,440 --> 00:33:21,165 So now instead of doing that, we do a cyclic distribution. 807 00:33:21,165 --> 00:33:23,370 We use a static one. 808 00:33:23,370 --> 00:33:27,690 That means if you have a lot more than and a little bit 809 00:33:27,690 --> 00:33:32,220 better distribution so what happens to the processor? 810 00:33:32,220 --> 00:33:33,810 Zero gets this one and this one. 811 00:33:33,810 --> 00:33:35,740 One gets this one and this one. 812 00:33:35,740 --> 00:33:36,860 So on and so forth. 813 00:33:36,860 --> 00:33:39,380 So that would be a little bit between balancing there. 814 00:33:39,380 --> 00:33:42,830 But if you have enough of cyclic, the imbalance would be 815 00:33:42,830 --> 00:33:45,550 much lower. 816 00:33:45,550 --> 00:33:47,610 So should we run faster? 817 00:33:47,610 --> 00:33:51,070 818 00:33:51,070 --> 00:33:56,470 So here's the iterations each guy gets in here. 819 00:33:56,470 --> 00:33:59,360 This looks very balanced because I had a lot more 820 00:33:59,360 --> 00:34:01,810 iterations than this eight one. 821 00:34:01,810 --> 00:34:03,420 This is not that balanced here because this guy gets a lot 822 00:34:03,420 --> 00:34:04,800 more than the first one. 823 00:34:04,800 --> 00:34:06,820 The first one gets six. 824 00:34:06,820 --> 00:34:09,070 And the second and last one gets a lot more. 825 00:34:09,070 --> 00:34:13,775 826 00:34:13,775 --> 00:34:16,560 Uh oh. 827 00:34:16,560 --> 00:34:18,060 What do you think is happening here now? 828 00:34:18,060 --> 00:34:22,040 829 00:34:22,040 --> 00:34:23,290 I ran again slower. 830 00:34:23,290 --> 00:34:25,870 831 00:34:25,870 --> 00:34:28,429 See I guess the people in class last year had things 832 00:34:28,429 --> 00:34:31,199 worse because they had this old processor that did all 833 00:34:31,199 --> 00:34:33,010 these crazy things on them. 834 00:34:33,010 --> 00:34:35,830 and you guys get the fast one that doesn't do that. 835 00:34:35,830 --> 00:34:41,310 So why do you think cyclic distribution is 836 00:34:41,310 --> 00:34:43,639 running a lot slower? 837 00:34:43,639 --> 00:34:44,454 What might be going? 838 00:34:44,454 --> 00:34:45,704 AUDIENCE: [INAUDIBLE] 839 00:34:45,704 --> 00:34:47,420 840 00:34:47,420 --> 00:34:49,219 PROFESSOR: Spoiling [UNINTELLIGIBLE] it's not that 841 00:34:49,219 --> 00:34:51,830 much because if you don't run this and synchronize, what you 842 00:34:51,830 --> 00:34:54,929 do is you run the same amount of tread and say, now, instead 843 00:34:54,929 --> 00:34:59,260 of running continuously, you run jumping all iterations. 844 00:34:59,260 --> 00:35:01,810 You should run zero and nine or whatever jump over 845 00:35:01,810 --> 00:35:03,060 iterations. 846 00:35:03,060 --> 00:35:12,500 847 00:35:12,500 --> 00:35:13,130 Why do you think? 848 00:35:13,130 --> 00:35:14,630 AUDIENCE: [INAUDIBLE] 849 00:35:14,630 --> 00:35:16,240 PROFESSOR: Yeah, there's a cache issue. 850 00:35:16,240 --> 00:35:19,980 All this time and the question is not sure, it's probably a 851 00:35:19,980 --> 00:35:20,470 cache issue. 852 00:35:20,470 --> 00:35:22,550 What type of cache issue do you think is going on? 853 00:35:22,550 --> 00:35:23,800 AUDIENCE: [INAUDIBLE] 854 00:35:23,800 --> 00:35:28,920 855 00:35:28,920 --> 00:35:29,410 PROFESSOR: Yeah. 856 00:35:29,410 --> 00:35:31,010 [UNINTELLIGIBLE]. 857 00:35:31,010 --> 00:35:34,280 But let me show you what happens. 858 00:35:34,280 --> 00:35:35,580 So you get off then. 859 00:35:35,580 --> 00:35:37,780 OK, so if you look at-- 860 00:35:37,780 --> 00:35:40,150 the data is here so let's look at what happens. 861 00:35:40,150 --> 00:35:42,950 So this is running a [UNINTELLIGIBLE] lower. 862 00:35:42,950 --> 00:35:45,460 It's showing a lot more instructions, but instruction 863 00:35:45,460 --> 00:35:47,710 doesn't tell you too much because a lot of them might be 864 00:35:47,710 --> 00:35:49,360 missing synchronization costs in here. 865 00:35:49,360 --> 00:35:52,690 So instruction is not that illuminating here. 866 00:35:52,690 --> 00:35:54,850 The big illumination here is this one again. 867 00:35:54,850 --> 00:35:56,950 Invalidations. 868 00:35:56,950 --> 00:35:59,300 I have a huge amount of invalidations going on. 869 00:35:59,300 --> 00:36:04,330 So here is a case of false sharing. 870 00:36:04,330 --> 00:36:08,080 So what happens is now things next to each other, you want 871 00:36:08,080 --> 00:36:09,550 to multiply different processors. 872 00:36:09,550 --> 00:36:10,890 We're not touching the same data. 873 00:36:10,890 --> 00:36:13,250 Everybody's looking at somebody else's data. 874 00:36:13,250 --> 00:36:16,290 So what happens is assume I want to write this data item. 875 00:36:16,290 --> 00:36:17,550 I like that data item. 876 00:36:17,550 --> 00:36:22,430 But I get the entire cache line because when I ask for 877 00:36:22,430 --> 00:36:25,330 that, I get my synchronization by the cache line. 878 00:36:25,330 --> 00:36:28,830 I get this entire cache line coming in here into this one. 879 00:36:28,830 --> 00:36:30,550 And the next guys [UNINTELLIGIBLE] at me. 880 00:36:30,550 --> 00:36:33,020 This core won't write this data because instead of 881 00:36:33,020 --> 00:36:36,000 blocks, I basically give each strips. 882 00:36:36,000 --> 00:36:37,840 There's a lot of overlap between strips. 883 00:36:37,840 --> 00:36:41,000 So this guy says not to write this one, I had to get the 884 00:36:41,000 --> 00:36:43,580 entire cache line going back here. 885 00:36:43,580 --> 00:36:45,630 And so if you want to write that again, I had to get the 886 00:36:45,630 --> 00:36:47,680 entire cache line going back even though we are writing 887 00:36:47,680 --> 00:36:48,790 different data. 888 00:36:48,790 --> 00:36:52,010 Because we are sharing cache lines in here. 889 00:36:52,010 --> 00:36:53,880 This thinking was in back and forth, back and 890 00:36:53,880 --> 00:36:54,960 forth, back and forth. 891 00:36:54,960 --> 00:36:56,520 I have a lot of cache [UNINTELLIGIBLE]. 892 00:36:56,520 --> 00:36:59,200 Things are really shot. 893 00:36:59,200 --> 00:37:00,160 OK? 894 00:37:00,160 --> 00:37:03,740 And so what happens here is if you look at the cache lines-- 895 00:37:03,740 --> 00:37:05,140 there's my animation. 896 00:37:05,140 --> 00:37:07,280 So cache lines basically mess this all up. 897 00:37:07,280 --> 00:37:08,690 You can see that really carefully. 898 00:37:08,690 --> 00:37:12,500 What happens is between these lines, there would be some 899 00:37:12,500 --> 00:37:13,750 overlap of cache lines. 900 00:37:13,750 --> 00:37:15,710 901 00:37:15,710 --> 00:37:17,890 And this overlap in cache lines keeps bouncing back and 902 00:37:17,890 --> 00:37:19,990 forth, back and forth in here. 903 00:37:19,990 --> 00:37:23,630 And so what happens is basically cache lines are 904 00:37:23,630 --> 00:37:28,850 bigger than the data size, or there's overlap in here, and 905 00:37:28,850 --> 00:37:31,480 the cache line is shared when the data is not shared. 906 00:37:31,480 --> 00:37:37,050 And so how to detect false sharing in too many conflicts. 907 00:37:37,050 --> 00:37:41,340 You assume this is a nice parallelism, but suddenly, you 908 00:37:41,340 --> 00:37:43,840 don't have a speed up, and you have a lot of conflicts here, 909 00:37:43,840 --> 00:37:47,090 even though there isn't something to be sharing. 910 00:37:47,090 --> 00:37:49,770 And how to eliminate false sharing. 911 00:37:49,770 --> 00:37:54,250 Make data used by each contiguous in memory. 912 00:37:54,250 --> 00:37:55,550 That's a good way of doing that. 913 00:37:55,550 --> 00:37:57,360 Or pad at the end. 914 00:37:57,360 --> 00:38:00,330 So these kind of at the corners, there's not going to 915 00:38:00,330 --> 00:38:02,830 be any overlapping. 916 00:38:02,830 --> 00:38:07,770 So in here, one thing you can do is, you can measure each 917 00:38:07,770 --> 00:38:10,370 thing that each of the cores get. 918 00:38:10,370 --> 00:38:11,350 We can make [UNINTELLIGIBLE]. 919 00:38:11,350 --> 00:38:14,560 But before what happens was a core used to get this 920 00:38:14,560 --> 00:38:15,970 line and this line. 921 00:38:15,970 --> 00:38:17,730 There are different places in memory. 922 00:38:17,730 --> 00:38:20,220 But you can make these two contiguous in memory by 923 00:38:20,220 --> 00:38:22,190 basically now, instead of having a two-dimensional 924 00:38:22,190 --> 00:38:23,940 array, you made that a 925 00:38:23,940 --> 00:38:26,210 three-dimensional or disarrays. 926 00:38:26,210 --> 00:38:27,890 AUDIENCE: Can you say that again? 927 00:38:27,890 --> 00:38:31,030 PROFESSOR: So before you what just happened was each of them 928 00:38:31,030 --> 00:38:34,210 were going to get this line and this line, each core. 929 00:38:34,210 --> 00:38:38,070 All these lines that were in different parts of the memory. 930 00:38:38,070 --> 00:38:40,510 In here, each would get only two lines. 931 00:38:40,510 --> 00:38:41,570 But they're in a different place. 932 00:38:41,570 --> 00:38:43,845 So if you have more cyclic, you'll get a lot more lines or 933 00:38:43,845 --> 00:38:44,820 lower memory. 934 00:38:44,820 --> 00:38:47,830 So what we can do is we can arrange the cache. 935 00:38:47,830 --> 00:38:50,560 So if you think about this, you can think the cache, now 936 00:38:50,560 --> 00:38:53,220 the data, is instead of two-dimensions is 937 00:38:53,220 --> 00:38:55,390 three-dimensional data. 938 00:38:55,390 --> 00:38:58,360 One dimension is this cyclic part in here. 939 00:38:58,360 --> 00:38:59,450 So we can do that. 940 00:38:59,450 --> 00:39:04,030 And then, you can change any way that the cyclic part, the 941 00:39:04,030 --> 00:39:06,290 one that I got this line and this line, now become 942 00:39:06,290 --> 00:39:07,660 contiguous. 943 00:39:07,660 --> 00:39:10,390 So you think about data as a two-dimension. 944 00:39:10,390 --> 00:39:11,790 You think about it as a cube. 945 00:39:11,790 --> 00:39:14,690 And you kind of change the cube for the inner dimension 946 00:39:14,690 --> 00:39:16,670 to be the one that's contiguous. 947 00:39:16,670 --> 00:39:18,980 So you can do data [UNINTELLIGIBLE] 948 00:39:18,980 --> 00:39:20,230 transformation and get there. 949 00:39:20,230 --> 00:39:23,290 950 00:39:23,290 --> 00:39:29,130 So now what happens is the role of core zero just gets 951 00:39:29,130 --> 00:39:30,590 contiguous in memory. 952 00:39:30,590 --> 00:39:32,370 And core one gets contiguous in memory. 953 00:39:32,370 --> 00:39:34,600 So if you're trying to make it contiguous, that's great. 954 00:39:34,600 --> 00:39:37,930 So between padding and making things contiguous, you can get 955 00:39:37,930 --> 00:39:38,840 good performance. 956 00:39:38,840 --> 00:39:41,520 And if you do data transformation, voila! 957 00:39:41,520 --> 00:39:44,940 My invalidations just went down drastically. 958 00:39:44,940 --> 00:39:47,900 I again have a nice load balancing here. 959 00:39:47,900 --> 00:39:50,100 Invalidations went down drastically. 960 00:39:50,100 --> 00:39:53,660 That means my [UNINTELLIGIBLE] increased a little bit and I 961 00:39:53,660 --> 00:39:56,600 get really nice speed up. 962 00:39:56,600 --> 00:40:01,320 So here are the kind of crazy things you are to do if you 963 00:40:01,320 --> 00:40:05,200 are doing things like algorithms that 964 00:40:05,200 --> 00:40:07,100 are not cache obvious. 965 00:40:07,100 --> 00:40:08,820 And if you are doing directly parallizing yourself without 966 00:40:08,820 --> 00:40:11,890 letting a nice [UNINTELLIGIBLE] 967 00:40:11,890 --> 00:40:12,660 time to help you. 968 00:40:12,660 --> 00:40:16,580 Something like a [UNINTELLIGIBLE] assistant. 969 00:40:16,580 --> 00:40:19,710 So I'm just going to summarize this 970 00:40:19,710 --> 00:40:20,990 because this is important. 971 00:40:20,990 --> 00:40:22,630 We looked at a bunch of cache issues. 972 00:40:22,630 --> 00:40:25,420 We looked at cold missiles, capacity missiles, and 973 00:40:25,420 --> 00:40:26,970 conflict missiles before. 974 00:40:26,970 --> 00:40:29,390 And today, here are some examples of 975 00:40:29,390 --> 00:40:31,340 true sharing missiles. 976 00:40:31,340 --> 00:40:36,160 What happened was I am actually really using data, 977 00:40:36,160 --> 00:40:41,880 but I set up my parallelism in such a way that between 978 00:40:41,880 --> 00:40:46,000 different executions, my data has to move across. 979 00:40:46,000 --> 00:40:47,030 [UNINTELLIGIBLE] 980 00:40:47,030 --> 00:40:51,230 So I am truly sharing data, but the data has to go to 981 00:40:51,230 --> 00:40:52,250 somebody else's cache. 982 00:40:52,250 --> 00:40:53,110 So I've got a lot of [UNINTELLIGIBLE] 983 00:40:53,110 --> 00:40:54,830 violations here. 984 00:40:54,830 --> 00:40:58,420 More into this one is more like false sharing, where you 985 00:40:58,420 --> 00:41:01,240 assume there's no sharing, nice parallelism, everything, 986 00:41:01,240 --> 00:41:03,120 except the program runs very slow. 987 00:41:03,120 --> 00:41:05,610 And that can be because of false sharing. 988 00:41:05,610 --> 00:41:09,550 So we just kind of touch on these two topics. 989 00:41:09,550 --> 00:41:10,660 OK? 990 00:41:10,660 --> 00:41:14,340 So let me switch gears a little bit about dependences. 991 00:41:14,340 --> 00:41:16,380 We touched on the dependences a little bit. 992 00:41:16,380 --> 00:41:19,070 And these are two fine programs that are not 993 00:41:19,070 --> 00:41:20,410 completely parallel. 994 00:41:20,410 --> 00:41:24,710 So normally, what happens is a true dependence means that I'm 995 00:41:24,710 --> 00:41:26,570 writing and reading [UNINTELLIGIBLE] 996 00:41:26,570 --> 00:41:27,660 other way out. 997 00:41:27,660 --> 00:41:30,400 And if two guys are both fighting, then the order has 998 00:41:30,400 --> 00:41:32,780 to maintiain us out would be dependence. 999 00:41:32,780 --> 00:41:38,650 And did our dependence even loop, because 1000 00:41:38,650 --> 00:41:40,880 these are single items. 1001 00:41:40,880 --> 00:41:43,760 So if you have an error here, this is becoming a lot more 1002 00:41:43,760 --> 00:41:44,600 complicated. 1003 00:41:44,600 --> 00:41:46,200 Because there's no simple thing in here. 1004 00:41:46,200 --> 00:41:48,450 Because it's not just using the same iteration. 1005 00:41:48,450 --> 00:41:51,910 You might be using data from different iterations. 1006 00:41:51,910 --> 00:41:55,920 So what happens is there's a dynamic instance of 1007 00:41:55,920 --> 00:41:56,840 iterations. 1008 00:41:56,840 --> 00:41:59,230 So one iteration writes the data, and somebody else might 1009 00:41:59,230 --> 00:42:01,310 be reading the data. 1010 00:42:01,310 --> 00:42:03,490 And that is basically the order we have to 1011 00:42:03,490 --> 00:42:03,990 [UNINTELLIGIBLE]. 1012 00:42:03,990 --> 00:42:05,060 Let me give you an example. 1013 00:42:05,060 --> 00:42:07,580 This kind of demonstrates what's going on. 1014 00:42:07,580 --> 00:42:08,170 OK? 1015 00:42:08,170 --> 00:42:10,500 And when you edit, you say look, this is where you 1016 00:42:10,500 --> 00:42:10,910 [UNINTELLIGIBLE] 1017 00:42:10,910 --> 00:42:12,590 complicated. 1018 00:42:12,590 --> 00:42:14,380 So in order to give you and example, let me 1019 00:42:14,380 --> 00:42:15,876 look at this program. 1020 00:42:15,876 --> 00:42:17,150 I have a simple program here. 1021 00:42:17,150 --> 00:42:19,210 Ai equals Ai plus one. 1022 00:42:19,210 --> 00:42:19,990 My iterations-- 1023 00:42:19,990 --> 00:42:21,570 I'm running five iterations in here. 1024 00:42:21,570 --> 00:42:23,420 So this is my iteration space. 1025 00:42:23,420 --> 00:42:25,590 I have a large array, so this is my data space. 1026 00:42:25,590 --> 00:42:28,090 1027 00:42:28,090 --> 00:42:29,990 And now, I keep running this program. 1028 00:42:29,990 --> 00:42:32,800 So what happens is this is time going down in here. 1029 00:42:32,800 --> 00:42:35,930 So the first situation basically, I first read and 1030 00:42:35,930 --> 00:42:36,560 then write. 1031 00:42:36,560 --> 00:42:39,350 Same in the second iteration, I read and write. 1032 00:42:39,350 --> 00:42:40,700 Third iteration read and write. 1033 00:42:40,700 --> 00:42:42,770 Fourth iteration, read and write. 1034 00:42:42,770 --> 00:42:44,610 Do you see how this is going on these four situations? 1035 00:42:44,610 --> 00:42:46,050 Second iteration, third iteration, fourth iteration, 1036 00:42:46,050 --> 00:42:48,115 [UNINTELLIGIBLE]. 1037 00:42:48,115 --> 00:42:48,580 OK. 1038 00:42:48,580 --> 00:42:50,710 So what happens is first iteration read 1039 00:42:50,710 --> 00:42:53,810 this value is zero. 1040 00:42:53,810 --> 00:42:56,170 And write the value as zero in the menu writing. 1041 00:42:56,170 --> 00:43:00,930 Second iteration A1, A1, A2, A2, A3, A3. 1042 00:43:00,930 --> 00:43:07,290 So now, when this is writing, that's a dependence 1043 00:43:07,290 --> 00:43:08,160 between these two. 1044 00:43:08,160 --> 00:43:10,150 You see the true and entire output dependence 1045 00:43:10,150 --> 00:43:11,400 between these two. 1046 00:43:11,400 --> 00:43:15,545 1047 00:43:15,545 --> 00:43:18,270 What type of dependence do we have? 1048 00:43:18,270 --> 00:43:19,520 [UNINTELLIGIBLE] dependence. 1049 00:43:19,520 --> 00:43:24,240 1050 00:43:24,240 --> 00:43:26,340 True dependence is what? 1051 00:43:26,340 --> 00:43:28,780 What to what? 1052 00:43:28,780 --> 00:43:29,940 What's the first thing that occurs? 1053 00:43:29,940 --> 00:43:33,620 What's the next thing that occurs? 1054 00:43:33,620 --> 00:43:34,875 Anybody want to answer? 1055 00:43:34,875 --> 00:43:40,820 1056 00:43:40,820 --> 00:43:43,140 AUDIENCE: [INAUDIBLE] 1057 00:43:43,140 --> 00:43:43,900 PROFESSOR: Write to read. 1058 00:43:43,900 --> 00:43:45,720 So you have the first thing has to be write to read. 1059 00:43:45,720 --> 00:43:46,570 Watch this. 1060 00:43:46,570 --> 00:43:48,800 This is a read to write. 1061 00:43:48,800 --> 00:43:51,230 So what type of dependence is this? 1062 00:43:51,230 --> 00:43:52,240 This is anti-dependent. 1063 00:43:52,240 --> 00:43:54,680 So here is ante-dependence in here. 1064 00:43:54,680 --> 00:43:57,710 But the nice thing about that is this dependent didn't cross 1065 00:43:57,710 --> 00:43:58,940 the iteration boundary. 1066 00:43:58,940 --> 00:44:01,050 So these black lines are my iteration boundaries. 1067 00:44:01,050 --> 00:44:03,190 So these are for situations that [UNINTELLIGIBLE]. 1068 00:44:03,190 --> 00:44:06,780 So there's no iteration crossing in here. 1069 00:44:06,780 --> 00:44:09,360 You can kind of [UNINTELLIGIBLE] it using each 1070 00:44:09,360 --> 00:44:13,590 of these iterations and my dependencies within the very 1071 00:44:13,590 --> 00:44:14,360 same iteration. 1072 00:44:14,360 --> 00:44:18,116 So the same iteration I have dependency [UNINTELLIGIBLE]. 1073 00:44:18,116 --> 00:44:18,590 OK? 1074 00:44:18,590 --> 00:44:19,910 This is a simpler case. 1075 00:44:19,910 --> 00:44:24,060 So let's look at something a little bit more complicated. 1076 00:44:24,060 --> 00:44:28,560 So I have Ai plus 1 equals Ai plus 1. 1077 00:44:28,560 --> 00:44:32,430 So what happens is first I am reading Ai. 1078 00:44:32,430 --> 00:44:39,270 And then, I am writing Ai plus 1 in the same iteration. 1079 00:44:39,270 --> 00:44:41,100 The next iteration, I am reading now Ai 1080 00:44:41,100 --> 00:44:42,050 [UNINTELLIGIBLE] 1081 00:44:42,050 --> 00:44:44,250 this is A0 and 1. 1082 00:44:44,250 --> 00:44:45,890 This is A is 1. 1083 00:44:45,890 --> 00:44:46,410 [UNINTELLIGIBLE] 1084 00:44:46,410 --> 00:44:47,880 I am writing Ai plus 1. 1085 00:44:47,880 --> 00:44:49,130 I am writing 2. 1086 00:44:49,130 --> 00:44:51,610 1087 00:44:51,610 --> 00:44:54,070 So I have a dependence like this now. 1088 00:44:54,070 --> 00:44:55,320 What type of dependence is this? 1089 00:44:55,320 --> 00:44:58,190 1090 00:44:58,190 --> 00:45:00,560 This is a true dependence because I am writing. 1091 00:45:00,560 --> 00:45:04,610 And this is actually reading what what it is writing. 1092 00:45:04,610 --> 00:45:08,720 So does this look parallel? 1093 00:45:08,720 --> 00:45:09,590 No. 1094 00:45:09,590 --> 00:45:13,790 Because what happens is if you look at each iteration depends 1095 00:45:13,790 --> 00:45:15,680 on the previous iteration. 1096 00:45:15,680 --> 00:45:18,040 So you have to actually have this dependence going back and 1097 00:45:18,040 --> 00:45:20,890 forth, back and forth in here. 1098 00:45:20,890 --> 00:45:23,550 So let's look at a couple more other things. 1099 00:45:23,550 --> 00:45:26,830 So here is Ai equals Ai plus 2. 1100 00:45:26,830 --> 00:45:30,660 So I am basically reading Ai plus 2. 1101 00:45:30,660 --> 00:45:31,670 So I am reading this one. 1102 00:45:31,670 --> 00:45:33,220 I am writing this one. 1103 00:45:33,220 --> 00:45:33,810 Reading this one. 1104 00:45:33,810 --> 00:45:36,060 Writing this one. 1105 00:45:36,060 --> 00:45:38,040 Here is my dependence that's in here. 1106 00:45:38,040 --> 00:45:39,620 You see the two are anti in here. 1107 00:45:39,620 --> 00:45:42,210 1108 00:45:42,210 --> 00:45:44,150 This is anti-dependence because I am going from a 1109 00:45:44,150 --> 00:45:47,160 reading to a write in here. 1110 00:45:47,160 --> 00:45:48,510 Can this loop be parallel? 1111 00:45:48,510 --> 00:45:55,240 1112 00:45:55,240 --> 00:45:57,744 Can this loop run parallel? 1113 00:45:57,744 --> 00:46:00,440 AUDIENCE: [INAUDIBLE] 1114 00:46:00,440 --> 00:46:04,090 PROFESSOR: So can every iteration run parallel? 1115 00:46:04,090 --> 00:46:05,030 There could be basically. 1116 00:46:05,030 --> 00:46:07,590 No because what happens is if you look at that, there's a 1117 00:46:07,590 --> 00:46:09,690 dependence that goes like this. 1118 00:46:09,690 --> 00:46:11,060 And of course, there are two chains. 1119 00:46:11,060 --> 00:46:14,170 So if you are interested, you can run at least two-way 1120 00:46:14,170 --> 00:46:15,210 parallelism. 1121 00:46:15,210 --> 00:46:18,200 You can run one chain parallel to another chain when you do 1122 00:46:18,200 --> 00:46:21,130 get that much parallelism. 1123 00:46:21,130 --> 00:46:22,380 How about this one? 1124 00:46:22,380 --> 00:46:24,740 1125 00:46:24,740 --> 00:46:26,970 2i and 2i plus 1. 1126 00:46:26,970 --> 00:46:28,220 [UNINTELLIGIBLE] 1127 00:46:28,220 --> 00:46:29,890 1128 00:46:29,890 --> 00:46:31,140 Is there independence in here? 1129 00:46:31,140 --> 00:46:34,740 1130 00:46:34,740 --> 00:46:36,120 Nope because one is-- 1131 00:46:36,120 --> 00:46:38,300 you are reading all the elements and even writing 1132 00:46:38,300 --> 00:46:39,210 elements [UNINTELLIGIBLE] 1133 00:46:39,210 --> 00:46:39,730 dependence. 1134 00:46:39,730 --> 00:46:42,100 So you can have a missing parallel. 1135 00:46:42,100 --> 00:46:43,230 OK? 1136 00:46:43,230 --> 00:46:46,210 So this is the kind of interesting thing 1137 00:46:46,210 --> 00:46:46,890 that is going on. 1138 00:46:46,890 --> 00:46:51,150 So next, I want to look at something a little bit more 1139 00:46:51,150 --> 00:46:52,130 complicated. 1140 00:46:52,130 --> 00:46:54,400 So let's look at this. 1141 00:46:54,400 --> 00:46:59,850 So here's a classic algorithm called successive over 1142 00:46:59,850 --> 00:47:01,320 relaxation. 1143 00:47:01,320 --> 00:47:05,050 So it kind of simulates a lot of times things like heat flow 1144 00:47:05,050 --> 00:47:06,460 through a plane. 1145 00:47:06,460 --> 00:47:08,830 So the idea there is-- 1146 00:47:08,830 --> 00:47:10,500 let me illustrate what he does. 1147 00:47:10,500 --> 00:47:16,750 So assume you have a big metal sheet. 1148 00:47:16,750 --> 00:47:20,790 And you put some kind of heat source in one place. 1149 00:47:20,790 --> 00:47:23,640 And after sometime, it all reaches a steady state. 1150 00:47:23,640 --> 00:47:24,920 The other side might be cold. 1151 00:47:24,920 --> 00:47:28,190 And you want to know part of the sheet's temperature. 1152 00:47:28,190 --> 00:47:31,070 1153 00:47:31,070 --> 00:47:34,700 Because temperature can leak out. 1154 00:47:34,700 --> 00:47:40,515 And there are more things like you have a heat source and 1155 00:47:40,515 --> 00:47:44,480 others that [UNINTELLIGIBLE] to work a glass of water or 1156 00:47:44,480 --> 00:47:45,300 some kind of a sink. 1157 00:47:45,300 --> 00:47:47,650 So what is the heat distribution? 1158 00:47:47,650 --> 00:47:51,120 So one way to compare that this is basically the same 1159 00:47:51,120 --> 00:47:51,550 [UNINTELLIGIBLE]. 1160 00:47:51,550 --> 00:47:55,230 The heat value here is basically the average around 1161 00:47:55,230 --> 00:47:59,690 all these other values right now. 1162 00:47:59,690 --> 00:48:01,700 Because if something is too hot, the heat is going to 1163 00:48:01,700 --> 00:48:03,040 propagate something that is too cold. 1164 00:48:03,040 --> 00:48:04,800 The heat is going to propagate because it kind of has to be 1165 00:48:04,800 --> 00:48:06,820 average around that. 1166 00:48:06,820 --> 00:48:07,840 Then, you take the average in here. 1167 00:48:07,840 --> 00:48:09,260 So what it's doing is calculating 1168 00:48:09,260 --> 00:48:11,200 the average in here. 1169 00:48:11,200 --> 00:48:12,810 And then, you have to do it many, many times. 1170 00:48:12,810 --> 00:48:14,480 So if you have a heat source, at that point, it 1171 00:48:14,480 --> 00:48:15,300 would be very hard. 1172 00:48:15,300 --> 00:48:17,790 And then, it will start propagating slowly and kind of 1173 00:48:17,790 --> 00:48:18,810 propagate down. 1174 00:48:18,810 --> 00:48:21,890 And the cold side in this way or after running many times, 1175 00:48:21,890 --> 00:48:23,180 it basically stabilizes. 1176 00:48:23,180 --> 00:48:25,350 And at that point, you have the kind of heat distribution 1177 00:48:25,350 --> 00:48:26,180 that we [UNINTELLIGIBLE] have. 1178 00:48:26,180 --> 00:48:28,220 This is the kind of calculation you do. 1179 00:48:28,220 --> 00:48:30,070 So this is the calculation. 1180 00:48:30,070 --> 00:48:31,410 So what you're doing is calculating this. 1181 00:48:31,410 --> 00:48:35,220 You are creating this, this, this, and this 1182 00:48:35,220 --> 00:48:36,370 and updating that. 1183 00:48:36,370 --> 00:48:38,480 And then, you do it for t time stamps. 1184 00:48:38,480 --> 00:48:41,380 So you just go around doing each of these things first and 1185 00:48:41,380 --> 00:48:44,284 doing it for t time stamps. 1186 00:48:44,284 --> 00:48:44,740 OK? 1187 00:48:44,740 --> 00:48:46,608 So we would like to run this parallel. 1188 00:48:46,608 --> 00:48:49,620 1189 00:48:49,620 --> 00:48:53,650 So here's my basically data space. 1190 00:48:53,650 --> 00:48:55,270 There's my data items. 1191 00:48:55,270 --> 00:48:56,980 So here's my array, two-dimensional array. 1192 00:48:56,980 --> 00:48:58,540 So this is how I'm trying to update. 1193 00:48:58,540 --> 00:49:00,170 I'm reading all this file. 1194 00:49:00,170 --> 00:49:01,760 So here's my iteration space. 1195 00:49:01,760 --> 00:49:03,050 So what I have looked at this. 1196 00:49:03,050 --> 00:49:04,590 I don't want to-- 1197 00:49:04,590 --> 00:49:06,400 it's hard to give you a 3D diagram. 1198 00:49:06,400 --> 00:49:08,190 I don't have a 3D projector. 1199 00:49:08,190 --> 00:49:11,160 So what I'm showing here is three-dimension here. 1200 00:49:11,160 --> 00:49:13,850 So this is the previous iteration, first iteration. 1201 00:49:13,850 --> 00:49:15,990 So if I still go tij. 1202 00:49:15,990 --> 00:49:19,800 So you go through t, and then you go through i in here, and 1203 00:49:19,800 --> 00:49:21,560 then, when you're done, you go to the [UNINTELLIGIBLE] 1204 00:49:21,560 --> 00:49:23,440 iteration and you go this way. 1205 00:49:23,440 --> 00:49:24,820 So here's how you would iterate. 1206 00:49:24,820 --> 00:49:27,300 So you run this one, this one, this one, this one, this one, 1207 00:49:27,300 --> 00:49:28,530 this one, this one, this one, this one. 1208 00:49:28,530 --> 00:49:31,500 And then increase t by 1, and go like this. 1209 00:49:31,500 --> 00:49:35,370 And right now, we are here. 1210 00:49:35,370 --> 00:49:37,140 We are trying to update this one. 1211 00:49:37,140 --> 00:49:39,080 That's what we are trying to do. 1212 00:49:39,080 --> 00:49:42,570 And that means we are already finished up to this point. 1213 00:49:42,570 --> 00:49:45,760 All these points are finished up. 1214 00:49:45,760 --> 00:49:50,160 Now, what we have to do is figure out when I'm reading, 1215 00:49:50,160 --> 00:49:53,480 who actually wrote this value. 1216 00:49:53,480 --> 00:49:53,920 OK? 1217 00:49:53,920 --> 00:49:56,830 First of all, let's figure out which iterations might be able 1218 00:49:56,830 --> 00:49:58,430 to write this value. 1219 00:49:58,430 --> 00:50:04,460 So if you look at this value, this 1220 00:50:04,460 --> 00:50:06,110 relationship in between here. 1221 00:50:06,110 --> 00:50:09,060 This one, basically, is ij. 1222 00:50:09,060 --> 00:50:11,030 And this is ij, ij, ij. 1223 00:50:11,030 --> 00:50:13,770 These three iterations can write this one. 1224 00:50:13,770 --> 00:50:17,650 So and these iterations can write this one. 1225 00:50:17,650 --> 00:50:19,770 Let me go to this one. 1226 00:50:19,770 --> 00:50:22,480 This is a pretty darn complicated [UNINTELLIGIBLE]. 1227 00:50:22,480 --> 00:50:26,650 So what that means is in this one, this one 1228 00:50:26,650 --> 00:50:28,870 already wrote something. 1229 00:50:28,870 --> 00:50:30,990 This is what I'm reading in here. 1230 00:50:30,990 --> 00:50:32,500 This one already wrote something. 1231 00:50:32,500 --> 00:50:33,355 This is what I'm reading here. 1232 00:50:33,355 --> 00:50:34,390 This iteration wrote something. 1233 00:50:34,390 --> 00:50:35,930 I read it here. 1234 00:50:35,930 --> 00:50:36,200 OK. 1235 00:50:36,200 --> 00:50:38,850 Everybody following so far? 1236 00:50:38,850 --> 00:50:40,170 How about this, guys? 1237 00:50:40,170 --> 00:50:42,665 Who wrote the value I am reading in these iterations? 1238 00:50:42,665 --> 00:50:47,530 1239 00:50:47,530 --> 00:50:50,300 In this one, I haven't reached there yet. 1240 00:50:50,300 --> 00:50:53,130 So who has written that? 1241 00:50:53,130 --> 00:50:56,812 So I assume this is t equals 1. 1242 00:50:56,812 --> 00:50:57,630 [UNINTELLIGIBLE] 1243 00:50:57,630 --> 00:50:59,260 somebody has to write those things. 1244 00:50:59,260 --> 00:51:03,150 So what that means is this also wrote all of those values 1245 00:51:03,150 --> 00:51:04,520 because I have done those iterations. 1246 00:51:04,520 --> 00:51:07,190 But the interesting thing is some of these values got 1247 00:51:07,190 --> 00:51:07,630 overwritten. 1248 00:51:07,630 --> 00:51:09,650 This value got overwritten , this value got overwritten. 1249 00:51:09,650 --> 00:51:13,310 So these two values disappear. 1250 00:51:13,310 --> 00:51:15,720 This value got overwritten by this guy. 1251 00:51:15,720 --> 00:51:20,080 This value got overwritten by this guy. 1252 00:51:20,080 --> 00:51:20,280 OK. 1253 00:51:20,280 --> 00:51:22,400 But we haven't overwritten this value, this value, and 1254 00:51:22,400 --> 00:51:23,310 this value yet. 1255 00:51:23,310 --> 00:51:25,850 This one, basically, I've just updated. 1256 00:51:25,850 --> 00:51:26,620 But I [UNINTELLIGIBLE] 1257 00:51:26,620 --> 00:51:28,894 this one. 1258 00:51:28,894 --> 00:51:30,676 Do you see this? 1259 00:51:30,676 --> 00:51:33,954 Is everybody following me? 1260 00:51:33,954 --> 00:51:34,906 AUDIENCE: Once again, sir. 1261 00:51:34,906 --> 00:51:36,810 I got lost. 1262 00:51:36,810 --> 00:51:40,630 So what are [INAUDIBLE] 1263 00:51:40,630 --> 00:51:42,640 PROFESSOR: So what happens is I am trying to update in this 1264 00:51:42,640 --> 00:51:46,780 iteration because this array get rid of multiple times. 1265 00:51:46,780 --> 00:51:50,200 But in each iteration, you are only doing one update. 1266 00:51:50,200 --> 00:51:52,220 So I am trying to read and write in here. 1267 00:51:52,220 --> 00:51:54,380 So I need to read all of these five 1268 00:51:54,380 --> 00:51:56,010 elements in this iteration. 1269 00:51:56,010 --> 00:51:58,450 So I want to figure out who wrote that. 1270 00:51:58,450 --> 00:51:59,550 OK? 1271 00:51:59,550 --> 00:52:03,180 This one can be written by this guy and this iteration. 1272 00:52:03,180 --> 00:52:05,440 Could this iteration write its value in here? 1273 00:52:05,440 --> 00:52:08,110 1274 00:52:08,110 --> 00:52:09,000 OK? 1275 00:52:09,000 --> 00:52:09,420 [UNINTELLIGIBLE] 1276 00:52:09,420 --> 00:52:11,990 This iteration write because we see it's writing ij. 1277 00:52:11,990 --> 00:52:16,890 I mean my diagram is not that great because I have three in 1278 00:52:16,890 --> 00:52:18,660 here and five in here. 1279 00:52:18,660 --> 00:52:19,890 So just bear with me on that. 1280 00:52:19,890 --> 00:52:21,395 So assume I am writing ij in here. 1281 00:52:21,395 --> 00:52:24,760 1282 00:52:24,760 --> 00:52:28,150 So my iterations go from 1 to n, but my data goes from 0 to 1283 00:52:28,150 --> 00:52:29,260 n plus 1, basically. 1284 00:52:29,260 --> 00:52:30,990 1 to n minus 1 iterations. 1285 00:52:30,990 --> 00:52:32,270 0 to n plus 1 data. 1286 00:52:32,270 --> 00:52:35,160 So data is bigger than iteration space because of 1287 00:52:35,160 --> 00:52:36,290 [UNINTELLIGIBLE]. 1288 00:52:36,290 --> 00:52:40,740 So what happens is when I'm in this iteration, we'll say this 1289 00:52:40,740 --> 00:52:44,770 is 1 2 iteration. 1290 00:52:44,770 --> 00:52:47,590 I will write this value. 1291 00:52:47,590 --> 00:52:52,370 This iteration will also write this value. 1292 00:52:52,370 --> 00:52:53,075 OK? 1293 00:52:53,075 --> 00:52:54,730 You see that? 1294 00:52:54,730 --> 00:52:56,400 All of these iterations are the same. 1295 00:52:56,400 --> 00:52:58,170 This iteration we will also write this value. 1296 00:52:58,170 --> 00:53:00,880 But right now, who is the last guy who wrote it? 1297 00:53:00,880 --> 00:53:02,400 The last guy that wrote it is this guy. 1298 00:53:02,400 --> 00:53:05,125 1299 00:53:05,125 --> 00:53:07,910 Because this iteration wrote it, and after that, it got 1300 00:53:07,910 --> 00:53:09,060 ordered in this one. 1301 00:53:09,060 --> 00:53:11,330 But this one hadn't occurred yet, so it hadn't been ordered 1302 00:53:11,330 --> 00:53:11,900 by this guy. 1303 00:53:11,900 --> 00:53:15,670 So the last guy who wrote it was this one. 1304 00:53:15,670 --> 00:53:16,820 So that's why I had to eliminate this. 1305 00:53:16,820 --> 00:53:19,630 But this data value-- 1306 00:53:19,630 --> 00:53:21,400 I haven't executed this iteration yet. 1307 00:53:21,400 --> 00:53:24,260 So nobody had written this one in this time stamp. 1308 00:53:24,260 --> 00:53:26,310 So it has to be from the previous time stamp. 1309 00:53:26,310 --> 00:53:32,300 So I read two values from the current time stamp, three 1310 00:53:32,300 --> 00:53:33,870 values from the previous time stamp. 1311 00:53:33,870 --> 00:53:35,120 These three values have to come from the 1312 00:53:35,120 --> 00:53:35,980 previous time stamp. 1313 00:53:35,980 --> 00:53:38,670 These two values that come from the current time stamp. 1314 00:53:38,670 --> 00:53:39,830 You see that? 1315 00:53:39,830 --> 00:53:40,105 OK. 1316 00:53:40,105 --> 00:53:41,350 Good. 1317 00:53:41,350 --> 00:53:45,250 So what that means is because dependence means-- 1318 00:53:45,250 --> 00:53:47,500 OK. 1319 00:53:47,500 --> 00:53:50,455 This line, this dark, red line. 1320 00:53:50,455 --> 00:53:51,160 See. 1321 00:53:51,160 --> 00:53:54,975 I am reading a value in a current iteration that was 1322 00:53:54,975 --> 00:53:56,070 written by this iteration. 1323 00:53:56,070 --> 00:53:58,210 So that means I have no dependence between these two 1324 00:53:58,210 --> 00:54:00,670 iterations. 1325 00:54:00,670 --> 00:54:00,960 OK. 1326 00:54:00,960 --> 00:54:03,930 This line, this dark, red line. 1327 00:54:03,930 --> 00:54:06,020 I am reading a value written by this iteration. 1328 00:54:06,020 --> 00:54:07,870 So I have a dependency in here. 1329 00:54:07,870 --> 00:54:10,070 This line means I have a dependence between this 1330 00:54:10,070 --> 00:54:11,750 iteration to the current one. 1331 00:54:11,750 --> 00:54:13,310 This line means I have dependence between this 1332 00:54:13,310 --> 00:54:14,540 iteration and the current one. 1333 00:54:14,540 --> 00:54:15,980 This line means I have dependence between this 1334 00:54:15,980 --> 00:54:18,980 iteration and the current one. 1335 00:54:18,980 --> 00:54:20,230 You see that? 1336 00:54:20,230 --> 00:54:22,480 1337 00:54:22,480 --> 00:54:23,320 OK. 1338 00:54:23,320 --> 00:54:28,560 So now, I want to see how we can parallelize this group. 1339 00:54:28,560 --> 00:54:30,380 So what can I do? 1340 00:54:30,380 --> 00:54:32,200 So I look at all this dependence. 1341 00:54:32,200 --> 00:54:34,720 At this point, I don't have to think about all this where who 1342 00:54:34,720 --> 00:54:34,910 wrote what. 1343 00:54:34,910 --> 00:54:37,840 I can say this is dependence. 1344 00:54:37,840 --> 00:54:43,310 In order to do this equation, all these iterations have to 1345 00:54:43,310 --> 00:54:47,440 be done because I am losing the values produced by them. 1346 00:54:47,440 --> 00:54:49,740 So these have to be finished before I can patch that. 1347 00:54:49,740 --> 00:54:51,760 So the parallelism means I tried to 1348 00:54:51,760 --> 00:54:54,430 do things in parallel. 1349 00:54:54,430 --> 00:54:56,900 So can we parallelize this loop? 1350 00:54:56,900 --> 00:55:00,120 1351 00:55:00,120 --> 00:55:01,630 Can we run each time stamp separately? 1352 00:55:01,630 --> 00:55:05,110 1353 00:55:05,110 --> 00:55:08,940 No because I am using these three values from the previous 1354 00:55:08,940 --> 00:55:10,100 time stamp. 1355 00:55:10,100 --> 00:55:14,030 So I can't run this time stamp, B equals 1, until B 1356 00:55:14,030 --> 00:55:15,670 equals 0 is done. 1357 00:55:15,670 --> 00:55:16,460 Or B plus 2 [UNINTELLIGIBLE] 1358 00:55:16,460 --> 00:55:18,700 B plus 1 is done. 1359 00:55:18,700 --> 00:55:19,030 OK? 1360 00:55:19,030 --> 00:55:22,690 So I can't parallelize this loop. 1361 00:55:22,690 --> 00:55:22,930 OK. 1362 00:55:22,930 --> 00:55:26,320 Can I parallelize this loop? 1363 00:55:26,320 --> 00:55:27,850 Why? 1364 00:55:27,850 --> 00:55:31,000 Will dependence stop me from parallelizing this one? 1365 00:55:31,000 --> 00:55:36,070 1366 00:55:36,070 --> 00:55:37,710 So I'm looking at i. 1367 00:55:37,710 --> 00:55:40,180 This is my i dimension. 1368 00:55:40,180 --> 00:55:41,720 How many lines, at least, tell me. 1369 00:55:41,720 --> 00:55:43,370 How many dependencies are going to stop 1370 00:55:43,370 --> 00:55:46,850 me from doing that? 1371 00:55:46,850 --> 00:55:47,160 OK. 1372 00:55:47,160 --> 00:55:47,510 Good. 1373 00:55:47,510 --> 00:55:48,180 I have [UNINTELLIGIBLE]. 1374 00:55:48,180 --> 00:55:49,290 Somebody says three. 1375 00:55:49,290 --> 00:55:50,750 Somebody says one. 1376 00:55:50,750 --> 00:55:51,030 OK. 1377 00:55:51,030 --> 00:55:51,930 Let's get a vote. 1378 00:55:51,930 --> 00:55:53,545 How many people think it's three? 1379 00:55:53,545 --> 00:55:56,390 1380 00:55:56,390 --> 00:55:56,630 OK. 1381 00:55:56,630 --> 00:55:58,760 There's one vote for three. 1382 00:55:58,760 --> 00:56:00,060 How many people think it's three? 1383 00:56:00,060 --> 00:56:01,310 How many people think it's one? 1384 00:56:01,310 --> 00:56:03,770 1385 00:56:03,770 --> 00:56:04,530 Wait a minute. 1386 00:56:04,530 --> 00:56:08,100 One vote for three and two votes for one? 1387 00:56:08,100 --> 00:56:08,350 OK. 1388 00:56:08,350 --> 00:56:10,100 Where's the rest? 1389 00:56:10,100 --> 00:56:10,930 For two? 1390 00:56:10,930 --> 00:56:12,110 For 0? 1391 00:56:12,110 --> 00:56:13,480 Can't be 0 if the 0 is parallel. 1392 00:56:13,480 --> 00:56:14,300 OK. 1393 00:56:14,300 --> 00:56:16,150 So we'll start parallelizing. 1394 00:56:16,150 --> 00:56:16,750 OK. 1395 00:56:16,750 --> 00:56:18,960 So what happens in here? 1396 00:56:18,960 --> 00:56:21,210 Right now, this is actually one. 1397 00:56:21,210 --> 00:56:22,660 This one. 1398 00:56:22,660 --> 00:56:26,900 Because these things don't participate because this has 1399 00:56:26,900 --> 00:56:28,350 already happened. 1400 00:56:28,350 --> 00:56:33,030 When you go to ij iterations, these are already done. 1401 00:56:33,030 --> 00:56:34,170 So you're going from t. 1402 00:56:34,170 --> 00:56:36,160 So you're looking at the current iterations because 1403 00:56:36,160 --> 00:56:38,680 you're ending in two loops. 1404 00:56:38,680 --> 00:56:39,740 So the t is done. 1405 00:56:39,740 --> 00:56:41,720 So these all are already done when you go try 1406 00:56:41,720 --> 00:56:42,490 to parallelize sides. 1407 00:56:42,490 --> 00:56:44,740 So I don't have to worry about these three. 1408 00:56:44,740 --> 00:56:48,150 In here because actually I'm losing t of something here, I 1409 00:56:48,150 --> 00:56:51,480 am in trouble. 1410 00:56:51,480 --> 00:56:56,410 So when you go look at this one, I have this one. 1411 00:56:56,410 --> 00:56:59,020 So every dimension has a dependence in here. 1412 00:56:59,020 --> 00:57:00,750 So I can't run it in parallel. 1413 00:57:00,750 --> 00:57:02,740 So does this mean that there's no parallelism? 1414 00:57:02,740 --> 00:57:08,160 1415 00:57:08,160 --> 00:57:09,410 Who think there's no parallelism? 1416 00:57:09,410 --> 00:57:12,410 1417 00:57:12,410 --> 00:57:13,340 Who thinks there is? 1418 00:57:13,340 --> 00:57:14,370 Oh, somebody thinks there's no parallelism. 1419 00:57:14,370 --> 00:57:16,680 Who thinks there's parallelism? 1420 00:57:16,680 --> 00:57:17,170 OK. 1421 00:57:17,170 --> 00:57:18,080 More people think there's parallelism. 1422 00:57:18,080 --> 00:57:20,090 Let's see what we can do. 1423 00:57:20,090 --> 00:57:21,889 Question? 1424 00:57:21,889 --> 00:57:24,354 AUDIENCE: Do you really think [INAUDIBLE] 1425 00:57:24,354 --> 00:57:34,214 1426 00:57:34,214 --> 00:57:36,679 I'm trying to figure out how to word this. 1427 00:57:36,679 --> 00:57:38,158 Do you really want to have 1428 00:57:38,158 --> 00:57:39,144 dependence on the same concept? 1429 00:57:39,144 --> 00:57:40,394 [INAUDIBLE]? 1430 00:57:40,394 --> 00:57:43,620 1431 00:57:43,620 --> 00:57:43,880 PROFESSOR: Yeah. 1432 00:57:43,880 --> 00:57:45,730 I mean you can do-- 1433 00:57:45,730 --> 00:57:48,810 this is the way this SOR is sitting so there's a 1434 00:57:48,810 --> 00:57:50,420 dependence between time stamp. 1435 00:57:50,420 --> 00:57:51,750 There's another SOR. 1436 00:57:51,750 --> 00:57:53,380 What they do is kind of a red, black. 1437 00:57:53,380 --> 00:57:56,400 So when you calculate the next time stamp, you calculate it 1438 00:57:56,400 --> 00:57:57,760 right and complete the new array. 1439 00:57:57,760 --> 00:57:58,560 So there's no dependence. 1440 00:57:58,560 --> 00:58:00,900 So that's a different algorithm. 1441 00:58:00,900 --> 00:58:03,800 This algorithm, basically, uses sum value from 1442 00:58:03,800 --> 00:58:05,730 [UNINTELLIGIBLE] because the value-- the algorithm you're 1443 00:58:05,730 --> 00:58:06,985 talking-- you already created the other copies. 1444 00:58:06,985 --> 00:58:07,890 You had two copies. 1445 00:58:07,890 --> 00:58:09,170 You're bouncing back and forth. 1446 00:58:09,170 --> 00:58:09,530 Nice. 1447 00:58:09,530 --> 00:58:11,100 No real problem in here. 1448 00:58:11,100 --> 00:58:12,760 But then you had to have twice the amount of storage. 1449 00:58:12,760 --> 00:58:14,320 Here, you are updating in. 1450 00:58:14,320 --> 00:58:17,740 And since this is kind of running enough iterations 1451 00:58:17,740 --> 00:58:23,190 until it converges, it doesn't seem to matter that the 1452 00:58:23,190 --> 00:58:24,440 [UNINTELLIGIBLE PHRASE]. 1453 00:58:24,440 --> 00:58:26,520 1454 00:58:26,520 --> 00:58:27,280 OK. 1455 00:58:27,280 --> 00:58:32,380 So we cannot find a loop, what we call doall loop. 1456 00:58:32,380 --> 00:58:35,380 The doall loop means there's no loop carried dependences. 1457 00:58:35,380 --> 00:58:36,930 It's fully parallel. 1458 00:58:36,930 --> 00:58:38,900 This is the best case. 1459 00:58:38,900 --> 00:58:41,470 So what happens is when you get there, 1460 00:58:41,470 --> 00:58:42,620 everybody can run parallel. 1461 00:58:42,620 --> 00:58:44,770 And when you're done, you can stop and then do that. 1462 00:58:44,770 --> 00:58:46,540 So this is the doall loop. 1463 00:58:46,540 --> 00:58:47,560 Of course, there's no doall loop. 1464 00:58:47,560 --> 00:58:48,830 We can look at every dimension. 1465 00:58:48,830 --> 00:58:50,290 We had some kind of dependence. 1466 00:58:50,290 --> 00:58:53,430 So there's another choice, what we call doacross loop. 1467 00:58:53,430 --> 00:58:57,200 What that means is we have some loop carried dependence. 1468 00:58:57,200 --> 00:58:58,760 There's something I have to use for 1469 00:58:58,760 --> 00:59:00,320 the previous iteration. 1470 00:59:00,320 --> 00:59:01,910 But it's only one thing. 1471 00:59:01,910 --> 00:59:05,150 I have a lot of other things I can run around that only I 1472 00:59:05,150 --> 00:59:06,190 just have to wait one thing. 1473 00:59:06,190 --> 00:59:06,820 One is done. 1474 00:59:06,820 --> 00:59:08,190 I can just keep running. 1475 00:59:08,190 --> 00:59:11,020 And if I calculate and send this one early, then I can do 1476 00:59:11,020 --> 00:59:13,250 my other calculations later. 1477 00:59:13,250 --> 00:59:14,160 This is not that great. 1478 00:59:14,160 --> 00:59:15,820 If you look at the difference here. 1479 00:59:15,820 --> 00:59:19,350 This definitely has very little overhead in here. 1480 00:59:19,350 --> 00:59:21,330 This can run slow. 1481 00:59:21,330 --> 00:59:23,050 And of course, this thing gets produced very late. 1482 00:59:23,050 --> 00:59:24,790 It's [? almost ?] sequential. 1483 00:59:24,790 --> 00:59:27,890 So I hope you can just-- it the other guy wants something, 1484 00:59:27,890 --> 00:59:29,810 I can immediately send it very early. 1485 00:59:29,810 --> 00:59:31,560 And then I can run there. 1486 00:59:31,560 --> 00:59:35,600 So you can get some kind of doacross patterns in here. 1487 00:59:35,600 --> 00:59:37,690 So if you want to do this one-- 1488 00:59:37,690 --> 00:59:39,710 this is a little bit crazy in here. 1489 00:59:39,710 --> 00:59:41,580 But they'll do it in here. 1490 00:59:41,580 --> 00:59:44,260 And so what first we are to do is you are to say, OK. 1491 00:59:44,260 --> 00:59:44,640 Look. 1492 00:59:44,640 --> 00:59:48,570 I'm running this loop, the i loop in parallel. 1493 00:59:48,570 --> 00:59:52,410 But I have to exchange some data. 1494 00:59:52,410 --> 00:59:55,730 Before I want to run this one, I have to basically get the 1495 00:59:55,730 --> 00:59:58,060 previous i value produced. 1496 00:59:58,060 --> 01:00:00,620 And when it's done, I can say the next guy can use it. 1497 01:00:00,620 --> 01:00:02,490 So this is a very complicated one. 1498 01:00:02,490 --> 01:00:05,430 I don't want you to understand it too well. 1499 01:00:05,430 --> 01:00:09,210 So the reason I put it is to show that OK, if you want to 1500 01:00:09,210 --> 01:00:12,080 spend a week trying to really call this up and understand 1501 01:00:12,080 --> 01:00:14,230 and make sure that it works OK. 1502 01:00:14,230 --> 01:00:16,930 So you can do things like that. 1503 01:00:16,930 --> 01:00:17,910 OK? 1504 01:00:17,910 --> 01:00:18,570 Aha. 1505 01:00:18,570 --> 01:00:21,410 So this is the true voodooness. 1506 01:00:21,410 --> 01:00:23,170 OK. 1507 01:00:23,170 --> 01:00:28,150 AUDIENCE: So in Cilk, if you do this with divide and 1508 01:00:28,150 --> 01:00:34,400 conquer, you can make it be what I called in the Tableau 1509 01:00:34,400 --> 01:00:35,760 construction. 1510 01:00:35,760 --> 01:00:38,820 Each layer here is basically constructing a Tableau. 1511 01:00:38,820 --> 01:00:41,078 And so if you do it with divide and conquer, you can do 1512 01:00:41,078 --> 01:00:44,670 it with a very simple recursive code. 1513 01:00:44,670 --> 01:00:49,160 But you can also do it with a loop that goes diagonally. 1514 01:00:49,160 --> 01:00:49,260 AUDIENCE: [INTERPOSING VOICES] 1515 01:00:49,260 --> 01:00:49,390 PROFESSOR: Yes. 1516 01:00:49,390 --> 01:00:50,460 I'm going to get that. 1517 01:00:50,460 --> 01:00:52,690 That's next. 1518 01:00:52,690 --> 01:00:53,770 AUDIENCE: Sorry. 1519 01:00:53,770 --> 01:00:54,540 PROFESSOR: That's OK. 1520 01:00:54,540 --> 01:01:01,080 So the reason that I'm showing that is because this class is 1521 01:01:01,080 --> 01:01:05,210 not just about how to make the cores exactly run faster. 1522 01:01:05,210 --> 01:01:07,630 Think about algorithmic issues and stuff like that. 1523 01:01:07,630 --> 01:01:10,670 So sometimes, when you look at a problem, it looks crazy. 1524 01:01:10,670 --> 01:01:13,240 And there might be some changes you can do that you 1525 01:01:13,240 --> 01:01:16,590 can get to run things in parallel. 1526 01:01:16,590 --> 01:01:18,890 So I'm actually doing not diagonal. 1527 01:01:18,890 --> 01:01:21,030 I'm actually doing something very simple. 1528 01:01:21,030 --> 01:01:26,120 So what I have done here is I have all these 1529 01:01:26,120 --> 01:01:27,120 dependences in here. 1530 01:01:27,120 --> 01:01:27,660 OK? 1531 01:01:27,660 --> 01:01:34,380 So the problem here is I can't find a single [UNINTELLIGIBLE] 1532 01:01:34,380 --> 01:01:37,390 that basically has no crossing. 1533 01:01:37,390 --> 01:01:39,500 But if you look at this [UNINTELLIGIBLE] 1534 01:01:39,500 --> 01:01:41,140 diagonal here. 1535 01:01:41,140 --> 01:01:44,780 What you see is, in fact, there's nothing that crosses 1536 01:01:44,780 --> 01:01:46,030 the diagonal. 1537 01:01:46,030 --> 01:01:48,370 1538 01:01:48,370 --> 01:01:48,880 OK? 1539 01:01:48,880 --> 01:01:51,600 So this one basically doesn't depend on 1540 01:01:51,600 --> 01:01:52,790 this one or this one. 1541 01:01:52,790 --> 01:01:54,480 It only depends on the previous one. 1542 01:01:54,480 --> 01:01:57,230 So I can run everything in the diagonal parallel in 1543 01:01:57,230 --> 01:01:58,560 here in this one. 1544 01:01:58,560 --> 01:02:01,247 So of course, I can't write anything [UNINTELLIGIBLE] in 1545 01:02:01,247 --> 01:02:03,260 here, but there's a cute trick you can do. 1546 01:02:03,260 --> 01:02:06,400 What you can do is you can take iteration 1547 01:02:06,400 --> 01:02:07,910 space and skew it. 1548 01:02:07,910 --> 01:02:10,620 1549 01:02:10,620 --> 01:02:15,300 So what I have done is now instead off the same thing, 1550 01:02:15,300 --> 01:02:17,851 instead of this being a square, now I skewed it a 1551 01:02:17,851 --> 01:02:20,250 little bit. 1552 01:02:20,250 --> 01:02:20,790 OK? 1553 01:02:20,790 --> 01:02:28,730 So what that means is when I'm running first i, I basically 1554 01:02:28,730 --> 01:02:29,530 don't run any here. 1555 01:02:29,530 --> 01:02:32,380 Then, I run this one and this iteration here. 1556 01:02:32,380 --> 01:02:34,220 So what I have done is I have kind of moved my iteration 1557 01:02:34,220 --> 01:02:35,090 space around. 1558 01:02:35,090 --> 01:02:38,810 Do you see how this might be? 1559 01:02:38,810 --> 01:02:42,900 So now, the interesting thing is when I skew, if I look at 1560 01:02:42,900 --> 01:02:56,635 this line, I can parallelize in this one because all the 1561 01:02:56,635 --> 01:02:58,490 dependences come from the previous iteration. 1562 01:02:58,490 --> 01:02:59,150 Am I right? 1563 01:02:59,150 --> 01:03:00,400 [UNINTELLIGIBLE] 1564 01:03:00,400 --> 01:03:03,532 1565 01:03:03,532 --> 01:03:04,000 Yeah. 1566 01:03:04,000 --> 01:03:05,250 I skewed it. 1567 01:03:05,250 --> 01:03:10,864 1568 01:03:10,864 --> 01:03:16,440 Yes, everything in here, these ones are parallel. 1569 01:03:16,440 --> 01:03:16,990 OK? 1570 01:03:16,990 --> 01:03:21,070 And any dependence comes from the previous iteration. 1571 01:03:21,070 --> 01:03:22,670 There's no current iteration in here. 1572 01:03:22,670 --> 01:03:24,580 Everything in this one is parallel. 1573 01:03:24,580 --> 01:03:26,960 So I can parallelize this. 1574 01:03:26,960 --> 01:03:29,020 So this one doesn't depend on this one or this one. 1575 01:03:29,020 --> 01:03:32,000 So this is all parallel. 1576 01:03:32,000 --> 01:03:33,570 This is a little bit more complicated. 1577 01:03:33,570 --> 01:03:36,990 So if you're interested to go deep, just go 1578 01:03:36,990 --> 01:03:38,005 stare at the slides. 1579 01:03:38,005 --> 01:03:40,460 I have the slides out there to understand how that happens. 1580 01:03:40,460 --> 01:03:43,170 So if you think about what I'm running here in parallel is 1581 01:03:43,170 --> 01:03:47,660 the one basically this diagonal in here. 1582 01:03:47,660 --> 01:03:50,870 So what happens is if you run this, this, and this parallel, 1583 01:03:50,870 --> 01:03:51,690 there's no dependence. 1584 01:03:51,690 --> 01:03:54,600 I don't need this one or this one to run this one. 1585 01:03:54,600 --> 01:03:56,550 So I can run this, this, this, this, all 1586 01:03:56,550 --> 01:03:58,010 this diagonal in parallel. 1587 01:03:58,010 --> 01:03:59,680 But the trouble with just the diagonal is I don't have a 1588 01:03:59,680 --> 01:04:01,850 place in here to say [UNINTELLIGIBLE] 1589 01:04:01,850 --> 01:04:02,560 for a diagonal. 1590 01:04:02,560 --> 01:04:04,940 So I basically skewed it and then made a 1591 01:04:04,940 --> 01:04:06,760 diagonal into one loop. 1592 01:04:06,760 --> 01:04:12,780 So then, now what happens is basically j 1593 01:04:12,780 --> 01:04:15,850 loop I can run parallel. 1594 01:04:15,850 --> 01:04:17,800 This one. 1595 01:04:17,800 --> 01:04:20,450 So I can do it four [UNINTELLIGIBLE] four. 1596 01:04:20,450 --> 01:04:21,070 OK? 1597 01:04:21,070 --> 01:04:27,910 So here's something you found a problem that has no nice 1598 01:04:27,910 --> 01:04:28,200 parallelism. 1599 01:04:28,200 --> 01:04:31,370 But you realize there's kind of a what you call a wavefront 1600 01:04:31,370 --> 01:04:32,550 going on here. 1601 01:04:32,550 --> 01:04:33,530 Wave going on here. 1602 01:04:33,530 --> 01:04:36,870 So not the given dimension, but there's another dimension 1603 01:04:36,870 --> 01:04:37,690 that you can parallel. 1604 01:04:37,690 --> 01:04:41,090 So you kind of skewed your space to get that nice 1605 01:04:41,090 --> 01:04:41,690 [UNINTELLIGIBLE] line. 1606 01:04:41,690 --> 01:04:44,210 And you run parallel. 1607 01:04:44,210 --> 01:04:47,100 So that's all I have for today. 1608 01:04:47,100 --> 01:04:48,350